Towards a Syntactic Account of Punctuation 
Bernard Jones 
Centre for Cognitive Science 
University of Edinburgh 
2 Buccleuch Place 
Edinburgh EH8 9LW 
United Kingdom 
bernie¢cogsci, ed. ac. uk 
Abstract 
Little notice has been taken of punctu- 
ation in the field of natural language 
processing, chiefly due to the lack 
of any coherent theory on which to 
base implementations. Some work has 
been carried out concerning punctu- 
ation and parsing, but much of it 
seems to have been rather ad-hoc and 
performance-motivated. This paper 
describes the first step towards the 
construction of a theoretically-motivated 
account of punctuation. Parsed corpora 
are processed to extract punctuation 
patterns, which are then checked and 
generalised to a small set of General 
Punctuation Rules. Their usage is 
discussed, and suggestions are made for 
possible methods of including punctu- 
ation information in grammars. 
1 Introduction 
Ititherto, the field of punctuation has been almost 
completely ignored within Natural Language 
Processing, with perhaps the single exception 
of the sentence-final full-stop (period). The 
reason for this non-treatment has been the lack 
of any coherent theory Of punctuation on which 
a computational treatment could be based. As 
a result, most contemporary systems simply strip 
out punctuation in input text, and do not put any 
marks into generated texts. 
Intuitively, this s~ems very wrong, since punctu- 
ation is such an integral part of many written 
languages. If text in the real world (a newspaper, 
for example) were to appear without any punctu- 
ation marks, it would appear very stilted, 
ambiguous or infantile. Therefore it is likely that 
any computational system that ignores these extra 
textual cues will suffer a degradation in perfor- 
mance, or at the very least a great restriction in 
the class of linguistic data it is able to process. 
Several studies have already shown the 
potential for using punctuation within NLP. Dale 
(1991) has shown the positive benefits of using 
punctuation ill the fields of discourse structure 
and semantics, suggesting that it can be used to 
indicate degrees of rhetorical balance and aggre- 
gation between juxtaposed elements, and also that 
in certain cases a punctuation mark can determine 
the rhetorical relations that hold between two 
elements. 
In the field of syntax Jones (1994) has shown, 
through a comparison of the performance of a 
grammar that uses punctuation and one which 
does not, that for the more complex sentences 
of real language, parsing with a punctuated 
grammar yields around two orders of magnitude 
fewer parses than parsing with an nnpunctuated 
grammar, and that additionally the punctuated 
parses better reflect the linguistic structure of 
the sentences. Briscoe and Carroll (1995) extend 
this work to show the real contribution that 
usage of punctuation can make to the syntactic 
analysis of text. They also point out some funda- 
mental problems of the approach adopted by 
Jones (1994). 
If, based on the conclusions of these studies, 
we are to include punctuation in NLP systems 
it is necessary to have some theory upon which 
a treatment can be based. Thus far, the only 
account available is that of Nunberg (1990), which 
although it provides a useful basis for a theory is 
a little too vague to be used as the basis of' any 
implementation. In addition, the basic implemen- 
tation of Nunberg's punctuation linguistics seems 
untenable, certainly on a computational level, 
since it stipulates that punctuation phenomena 
should be treated on a seperate level to the lexical 
words in the sentence (Jones, 1994). It is also the 
case that Nunberg's treatment of punctuation is 
604 
too prescriptive to account for, or permit, some 
phenomena that occur in real language (Jones, 
:1995). 
Therefore it is necessary to develop a new 
theory of punctuation, that is suitable for compu- 
tational implementation. Work has already been 
carried out on the variety of punctuation marks 
and their interaction (Jones, :1995), showing 
that whilst tile set of symbols that we conven- 
tionally regard as punctuation (point punctu- 
ation, quotation and parenthetical symbols) 
account for the majority of punctuation in the 
written language (and therefore conld be imple- 
mented in a standardised way), there is another 
set of more unusual symbols, usually with a higher 
semantic content, which tend to be specific to the 
corpus in which they occur and therefore art; less 
suited to a standardised treatment. This study 
also shows that the average number of punctu- 
ation symbols to be expected in a sentence of 
English is four, thus reinforcing the argument 
for the inclusion of pnnctnation in language 
processing systems. 
Tile next step towards the devek)pnmnt of a 
theory of punctuation is the study of the inter- 
action of punctuation and the lexical items it 
separates, in particular the way that punctu- 
ation will integrate into grammars and syntax. 
The major problem of the ewduatory studies, 
(Dale (199l), Jones (1994), and to a far lesser: 
extent Briscoe & Carroll (1995)), was that their 
coverage and use of pun<:tuation was rather poor, 
being necessarily based on human intuitions and 
possible idiosyncrasies. What is needed therefore 
is a proper investigation into the syntactic roles 
that punctuation symbols can play, and a tbrmal- 
isation of these into instructions for the inclusion 
of punctuation in N\]+ grammars. 
2 Data Collection 
The best data sources are parsed corpora. Using 
these ensures a wide range of language is covered; 
since they are hand-parsed or checked tile parse 
will be (nominally) correct; and since there are 
many parsers/editors no individual's intuitions or 
idiosyncrasies will dominate. The set of parsed 
corpora is sadly very small but still suI\[icient to 
yield useflfl results. 
The corpus chosen was the Dow Jones section of 
the Penn rlYeebank (size: 1.95 million words). The 
bracketings were analysed so that each 'node' that 
has a puuctu~ttion mark as its imme(liate daughter 
is reported, with its other daughters abbreviated 
to their categories, as in. (i)- (3). 
(1) \[NP \[NP the following\] : \] ==~ \[Ne = NP :\] 
(2) \[S \[PP In Edinburgh\] , \[s ...\] ==ee\[s = m' , s\] 
(3) \[NP \[NP Bob\] , \[NP ...) , \] ==4> \[NP = NP , NP , \] 
In this fashion each sentence was broken down 
into a set of such category-patterns, resulting in a 
set of different categoryq)atterns for each punctu- 
ation symbol. These sets were then processed by 
hand to extract the underlying rule patterns from 
the raw category-patterns since these will include 
instances of serial repetition (4) and lexical 'break- 
through' in cases where phrases are not marked in 
the original corpus (5). 
(4) \[NP = NP, NP , NP , NP or NP\] 
(5) \[NP :-= each project , or activity pp\] 
These underlying rule-patterns represent all the 
ways that punctuation behaves in this corpus, and 
are good indicators of how the punctuation marks 
might behave in the rest of language. In the next 
sections we try to generalise these rule-patterns 
and discuss their possible implementation. 
3 Experimental Results 
There were 12,700 unique category-patterns 
extracted fl:om the corpus for the live most 
common marks of point punctuation, ranging 
from 9,320 for tile comma to 425 for the dash. 
These rules were then redu<'e<l to just lgZ under- 
lying rule-patterns ik)r the colon, seinicolon, dash, 
comma, full-stop. 
Even some of these underlying rule-patterns, 
however, are questionable since their incidence 
is very low (maybe once in the whole corpus) 
or their: form is so linguistically strange so as to 
(:all into doubt their correctness (possibly idiosyn- 
cratic mis-parses), as in (6). 
((3) \[ADVI'-'= PI', NP\] 
Therefore all the patterns were= checked against 
the original corpus to recover the original 
sentences. '\['he sentences for patterns with 
low incidence and those whose correetne.ss 
was (luestionable were. careNlly examined to 
(letermine whether there was arty justitication for 
a particular rule-pattern, given the content of the 
seutenee. 
Taking the subset of rules relating to the coh)n, 
for example, shows that there are 27 underlying 
rule patterns from the original analysis, as shown 
in table 1. 
By examining all (or. a representative subset) 
of the. sentences in the original corpus that yield 
605 
NP----NP:NP NP=S:NI' VP~-VP:VP S~---S:S 
NP=NI':PP NP~-PP:NP VP=VP:NP S=S:NI' 
NP ~.-~NP :VP PP-~-PP:PP VP=VP:PI ) S~S: 
NP---~NP:S PP=PP: VP=VP:S S=NP:S 
NP=NP: PP=AS IN: VP~VP: S~NP:VP 
NP=NP:ADJP PP----TO: S=VP:NP S~---PP:S 
NP~-~VI):NP S=VP:S S~---IJ:S 
Table 1 : Underlying colon rule-patterns 
NP=NP:NP NP=NP:S NP=NP:PP NP=NP:ADJP 
PP~t'P:PP PP~P:NI' VP~V:S VP~---V:NP 
S=S:S S=S:NI ) S~PI':S S=VPING:NP 
Table 2: Remaining colon rule-patterns 
these underlying rule-patterns, the majority of 
them can be eliminated. The only real underlying 
patterns are those in table 2. 
The rest of the rule-patterns were eliminated 
because they represented idiosyncratic brack- 
etings and category assignments in the original 
corpus, and so were covered by other rules. It 
should also be noted that some incorrect category 
assignments were made at the earlier data analysis 
stages, which explains why several of the revised 
rules have non-phrasal-level left-most daughters. 
Here are some examples of the inappropriate rule 
patterns. 
• S:NP:S -- inappropriate because the mother 
category should really be NP. Instances 
of this pattern in the corpus (7) are no 
different to instances of the similar rule with 
a NP mother and the pattern is more suited 
to a nominal interpretation. The problem 
has arisen in this case through confilsion of 
sentential and top categories in the grammar. 
Ahnost all items in the corpus are marked 
as sentences, although not all fulfil that 
grammatical role. 
(7) Another concern: the funds' share prices 
tend to swing more than the broader 
market. 
• NP=NP:VP all the verb phrases for this 
pattern were imperative ones, which can 
legitimately act as sentences (8). Therefor(; 
instances of this rule application are covered 
by the NP=NP:S rule. 
(8) Meanwhile stations are fuming because 
many of them say, the show's distributor, 
Viacom Inc, is giving an ultimatum: either 
sign new long-term commitments to buy 
future episodes or risk losing "Cosby" to a 
competitor. 
• VP~-VI':NP - a, case of misbracketing (9). 
The colon-expansion should not be bracketed 
as an adjunct to the ve but rather as an 
adjunct to the whole sentence in order to 
make linguistic sense. 
(9) The following were neither barred nor 
suspended: Stephanie Veselich Enright, 
\[...\] ; Stuart Lane Russel, \[...\] ; l)evon 
Nilson l)ahl, \[... \] 
It should be noted, however, that whilst all the 
twelve patterns in table 2 are valid, not all of them 
are normal colon expansions. There are seven 
exceptions. Significantly though, all the rule- 
patterns are in agreement with the description of 
colon use that can be found in publishers' style 
guides (Jarvie, 1992), which even cite the excep- 
tional cases found here. 
PP~-I' :NI .... uses the colon merely to 
introduce a conjunctive structure (10) - 
possibly one which is structurally separated 
fi'om the preceding sentence fi'agment in, say, 
an itemised list and that has quite linguisti- 
cally complex items. 
(10) We. like climbing up: rock, trees and clift; 
VPzV:NI' (~4 VP=V:S are similarly used 
to introduce conjunctive lists where the verb 
subcategorises for sentences or noun phrases, 
and also in certain writing styles to introduce 
direct speech (11). 
(ill) They said: "We went to the party." 
NI'=NP:NP the only instance in the whole 
corpus of this pattern was a book title (12). 
It unlikely to be used more fl'equently in any 
other circumstances. 
(12) "Big Red Contidentiah Inside Nebraska 
Football" 
• PI'=PP:PP -- possibly the most productive of 
the excepted rules, this rule pattern provides 
only for a colon expansion containing a clari- 
fying PP re-using the same preposition (13). 
Its use is very infl:equent, though. 
\[...\] spoke specifically of a third way: 
of having produced a historic synthesis of 
socialism and capitalism. 
606 
NP~NP:NP NP~NP:AI)JP I'P=PP:PP VP~V:NP S~S:NP S~PP:S 
NP~NP:S NI'~NP:PP PP~P:NP VP~V:S S~S:S S~VI'ING:NI' 
NI'~-NP ;NF' S~S;S VP ~-*VI';VP Pt' ~PP;I'I' S~PP. 
S~INTJ. S~S. S~=ADJt'. S~ADVP. S~NI'. S--VP. 
VP~VP-VP- 1'1'~t'1'-1~1 ~- NP ~-NP-NP- NI'--NP-VP- NP=-NP-S- NI'~NP-lq ~- 
S--S-S- ADJP~AI)JP-AI)JI'- S=S-PI'- S~S-NP- 
ADJP--~A1)JP, All JP--A1).IF'~AI)JP AI)JP=AI)JP,AI)VI > AI)JP=AI)JP~I'P AI)JP=AI)JP~S 
VP:VP~ VP:VP~VP VP~VF'~PP VP:~-VI'~S VP~VI':NI ' VP~VI~AI)VP 
AI)VP:ADVP~ AI)VI~AI)VF':AI)VI ' AI)VI>~-AI)VP~SBAII. VP---AI)VP~VP VI'~VP~ADJP 
NP~NP, NP~NP~NP NP~-~NP~S NP--NP~VP NI~NI'~PP NP~NP~AI)JP 
NP ~.~-NP ~AI)VP NP~AI)VP~NP NI'~INTJ~NP NP~PP~NI ~ NP~AI)JP~NP NI~VP~NP 
S~S, S~S~S S==S~Nt' S~S~VP SzS0~P S~S,AI)VI' 
S=S,INTJ S=IN~I'J~S S=AI)VI)~S S~PP,S S~NP~S S~VP,S 
S=-=CONJ,S PP---PP, PI'~I'I~,PP PI~-PI~:ADVP PP--AI)VP,IH ~ 
'Fable 3: Processed underlying punctuation rule p~-,tterns 
• S~Pl':S an exception since the mother 
category is not really a sentence (14). It is 
more likely to be an item in a list that is intro- 
duced by a phrase such as " Views we,v aired 
on the following matters:". 'Fhe fi'equency of 
this pattern in the corpus is an artifact of its 
journalistic mmlre. 
(14) On China's turmoil: "It is a very unhappy 
scene," he said. 
4, S=:VI'ING:NP a unique rule pattern whose 
mother is not strictly speaking a grammatical 
sentence (I 5). There are two solutions the 
initial verbal phrase can be treated either as 
a sentence with a null subject or as st gerund 
noun-l)hrase. 
(:15) Also spurring the move to (:loth: diaper 
covers with wdcro fasteners that eliminate 
Om need for safety pins. 
By repeating this pattern elimination for all the 
rules, the number of rule patterns were reduced to 
.just 79, and more than half of these related to the 
comma. The rules arc shown in table 3. Since 
some of the pal;terns only el)ply in particular, 
exceptional cases, the uulnl)er of 'standar(t' rules 
is reduced even tim;her. Also, since many valid 
rule-patterns occur infrequently in the corpus, 
there exists the possibility that there are further 
valid infrequent pmlctuation patterns that do 
not occur in the corpus. Whilst some of these 
may be hyl)othesized , and incorporated it,to a 
formalisation, other more obscure pat;terns may 
be missed, and so the guidelines postulated in this 
paper are not necessarily exhaustiw, for the whole 
language. 
4 l~ormalism 
If the exceptional cases are ignored, it is relatively 
straightforward to postulate some generalisations 
about the use of the wu:ious punctuation marks. 
(',()loll expansions seem only to occur in 
descriptive contexts. Thus their mother category 
can be either NP or s, descriptive c~ttegories, rather 
than the active vl' or locative l'p. The mother 
category of a colon expansion is always the s~uJm 
as the category to which the adjunct is a.ttachod 
(the lel't-n,ost d:mghter) and this is even t.rue 
of many of the exceptional rule patterns if the 
constraint is relaxed to allow the daughter to haw~ 
a lower bar-level. The phrase contained within the 
colon-exl)ansion (right-most daughter) nnlst also 
be descriptive, but can be AI)JP in addition to 
NP and s. (Although there was no rule pattern 
found in the corpus that had all adjectival colon 
expansion with a sentential mother-category, it; 
is certainly possible to imagine such a sentence 
(16).) 'Chererore (17) can 1)° po,~tnlat°(, as ;~ 
general colon-exl)ansion rule. 
(1(;) The cat; lay there quietly: relaxed and warm. 
(17) x: .t':{NPlslAl)..,} .V:{NP, S} 
q'he rule gencralisation for semicolons is very 
simI)le, since the semicolon only separates similar 
items (18). The possibility exists that this rule 
may apply to further categories such as adjeel, iwd 
and adverbial, although instances of this were not 
found in the corpus. 
(18) ,5 := S ;~"; S:{NP, S, VI', 1'1'} 
The generalisation for the fifll-stop is also 
straighl, R)rward, since it ~q)plies to all categories. 
The only t)roblem is that it is not necessarily 
suitable for all I, he resulting structm-cs to 1)e 
607 
referred to as sentences. The mothers should 
really all be top-category, since the full-stop is 
used to signal the end of a text-unit. Thus the 
generalisation in (19) is the most appropriate. 
(m) T = •. 
The dash interpolation is the first punctuation 
mark for which generalisation becomes slightly 
complicated. There appear to be two general 
rules, which overlap slightly. The first (20) simply 
states that a dash interpolation can contain an 
identical category to the phrase it follows. The 
second rule (21) extends this rule when applied 
to the two descriptive categories, so that a wider 
range of categories are permitted within the 
interpolation again, one of the rule-patterns 
permitted by (21) does not actually occur in the 
corpus, but does seem plausible. Note that since 
these rules incorporate a final dash, they will rely 
on Nunberg's (1990) principle of point absorption 
to delete the final dash if necessary. 
(20) ~ = 2) - t0- ~:{NP, S, VP, PI', ADaP} 
(21) g = g- { NP \] S I VP \] PP } - g:{Ne, S } 
The commas have tile most complicated set of 
rule-patterns. The generMisation seems to be that 
ally combination of phrasal categories is OK, so 
long as one of the daughter categories is identical 
to the mother category (22a&b). The restriction 
on this, and the reason why there are fewer rule- 
patterns for categories such as pP, ADJP and 
ADW', is that rules with the same daughters but 
more 'powerful' mother categories (e.g. sentential 
vs. adverbial) seem to be able to block the appli- 
cation of the 'less powerful' rules. 
(22) 6' = C , * C:{NP, S, VP, PP, ADJP, ADVP} 
d=.,C 
As an extension to these results of the analysis, 
it is relatively straight-forward to postulate the 
following simple rules (23-26), even though 
the punctuation symbols they refer to are not 
explicitly searched for ill this analysis, and they 
can in fact be verified in corpora. 
• For any sort of quotation-marks (excluding 
so-called "Victorian Quotation"). Note 
also that Nunberg's principle of quote- 
transposition is still necessary if this rule is 
to remain in its current form. 
(2a) Q="Q" Q:, 
• For stress-markers 
(24) Z = Z ? Z : * 
(25) y=y! y:, 
(26) }4,7 = 142 ... l/V: * 
5 Implementation Methodology 
The issue now arises of the best way to integrate 
punctuation into a NL grammar. There are three 
existing hypotheses to choose from. The theory 
of Nunberg (1990) is that punctuation should be 
treated in a 'text grammar' on a separate level to 
the lexical grammar. However, as pointed out by 
Jones (1994), it is difficult to see how this would 
be feasible in practice and there is little linguistic 
or psychological motivation for such a separation 
of lexicM text and punctuation. 
Therefore Jones (1.994) fully integrates punctu- 
ation and lexicM grammar, and in effect treats 
punctuation marks as clitics on words, intro- 
ducing additional features into normal syntactic 
rules (27).   riseoe and Carroll (190 ), however, 
point out that this rnM~es it hard to extract an 
independant text grammar or introduce modular 
semantics. Therefore their grammar keeps the 
punctuation and part-of-speech rules separate, 
but still allows them to be applied in an inter- 
leaved manner, in effect finding the happy mediuin 
between the two extreme approaches. Hence, 
additionally, their rules include the punctuation 
marks as distinct entities, rather than cliticising 
them, although they still require extra features to 
ensure proper application of the rules (28). 
(27) rip\[st S\] np\[st c\] np\[ t S\]' 
(28) V2\[wn-,1NV -\] -+ 
H2\[WlI-,HN +,-ta\] -I-pco 2 VI\[vFORM IN(l\] 
The most appropriate method would seem to 
be a combination of the two integrated methods 
above, combining their modularity, flexibility and 
power. Thus the Generalised Punctuation Rules 
obtained above could be encoded into a normal 
syntactic grammar to add punctuation capabil- 
ities. However, this will Mrnost certainly result 
in overgeneration of parses, as tile rules are still 
too flexible: they accurately describe syntactic 
situations where punctuation Call occur, but fail 
to place any constraints upon those situations. 
Itence some further theoretical work seems to be 
required to constrain the applicability of these 
rules. 
The main location for punctuation marks is 
likely to be with phrasal-level items, whether the 
marks occur before a particular phrasal item or 
after it. Punctuation does not seem to occur 
at levels below the phrasal, with one exception: 
punctuation is allowed to occur at any level in 
the context of coordination. Thus (29) represents 
l g represents a variable 
2 +pco represents a comma 
608 
legal use of punctuation adjoining a I)hrasal item 
since it occurs adjacent to the AD.n' within the 
NP. However, in (30) there is no phrasal item for 
the punctuation to attach to, and so its use is 
unsanctioned. Conjunctive punctuation use can 
bc seen in (31), where although occurring below 
the level of NP, the pnnctuation is legal because 
of its eonjmmtive context. 
(29) The green, more turquoise actually, bicycle ... 
(30) * The, bicycle is a joy to ride. 
(31) The shark, whale and dolphin can all swim. 
To generalise, then, l)unctuation seems to have 
adjunctive and conjunctive functions, and the 
theoretical formalisation of these function will 
form a good method of constraining the l)arses 
produced with the Generalised Rules above. 
6 Conclusion 
We have seen that by extracting punctuation 
patterns from a corpus it has been possible 
to postulate a small number of generalisations 
for punctuation rules within NL grammars. A 
suitable methodology for applying tmnctuation to 
existing grammars has also been suggested. Since 
many of the rule patterns seem to have a w'xy low 
frequency of occurrence it may also be useflfl to 
collect such frequencies and use them in the rule 
generalisations to attach probabilities to various 
rule expansions. We have also seen that the rule 
patterns we extracted fi'om the corpora agreed to 
a large extent with the descriptions of punctuation 
use found in publishers' style-guides, suggesting 
thai; reference to these may be usefnl. 
What is needed now is a thorough testing and 
evaluation of the suggestions made in this paper, 
both against lmnctuation patterns from other 
corpora and in parsing novel material, to maybe 
suggest better geimralisations. 'Fheu the next step 
towards a theory of punctuation can be carried 
out, namely the analysis of punctuation for its 
semantic flmction and content. 
My regards to the international academic and 
research comnmnity in the field of Computational 
Linguistics: thank-you, and good-bye! 
References 
Edward Briscoe. \]994. Parsing (with) Punctu- 
ation, and Shallow Syntactic Constraints on 
l~art-of-Speech Sequences. I{Xt{C' (\]rcnoble 
Laboratory, 'FechnicM Report. 
Edward 13riscoe and John Carroll. 1995. l)evel 
oping and \[,;valuating a Probabilistic LR Parser 
of Part-of-Speech and Punctuation 1,abels. In 
Proceedings of the ACL/,S'IGPAfL575' 4th Inter- 
national Workshop on Parsing Technologies, 
pages 48 58, Prague 
Robert 1)ale. 1991. Exploring the Role of Punctu- 
ation in the Signalling of l)iscoursc Structure. 
In Procccdinq~ of the Workshop on 7~'xt lh'.prc~- 
scntation and Domain Modelling, pages 110 
120, 'l'echnieM \[ \]nivcrsity 1 ~erlin. 
(An!don Jarvie. 11992. Chambers Punctuation 
Guide. W & R Ch ambers I,td., Edinburgh, U K. 
Bernard Jones. 1994. 
Exploring {;he Pmle of Punctuation in Parsing 
Real Text. In Proceedings of the 15th Interna- 
tional Co~@rencc on Computational Linquistics 
(COLING-gd), pages 421 425, Kyoto, ,lapan, 
August. 
Bernard ,hines. 1995. Ext)loring the Variety 
and Use of Punctuation. In Proceedings of 
lhc 17th Annual Co.qnili~c Science: Confl:r(;ncc, 
pages 619 624, Pittsburgh, Pennsylvania, ,luly. 
Geoffrey Nunberg. 1990. The Linguistics of 
Punctuation. CSLI I,ecture Notes 18, Stanfbrd, 
California. 
Acknowledgements 
This work was carried out under Research Award 
R00429334171 fi'om tile (UK) Economic and 
Social Research Council. 
Thanks for 
instructive and helpful comments and suggestions 
to Alexander tlolt, Henry q'hompson, Ted Briscoe 
and anonymous reviewers. 
609 
