File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/96/c96-2102_metho.xml
Size: 16,919 bytes
Last Modified: 2025-10-06 14:14:13
<?xml version="1.0" standalone="yes"?> <Paper uid="C96-2102"> <Title>Towards a Syntactic Account of Punctuation</Title> <Section position="4" start_page="604" end_page="604" type="metho"> <SectionTitle> 2 Data Collection </SectionTitle> <Paragraph position="0"> The best data sources are parsed corpora. Using these ensures a wide range of language is covered; since they are hand-parsed or checked tile parse will be (nominally) correct; and since there are many parsers/editors no individual's intuitions or idiosyncrasies will dominate. The set of parsed corpora is sadly very small but still suI\[icient to yield useflfl results.</Paragraph> <Paragraph position="1"> The corpus chosen was the Dow Jones section of the Penn rlYeebank (size: 1.95 million words). The bracketings were analysed so that each 'node' that has a puuctu~ttion mark as its imme(liate daughter is reported, with its other daughters abbreviated to their categories, as in. (i)- (3).</Paragraph> <Paragraph position="2"> (1) \[NP \[NP the following\] : \] ==~ \[Ne = NP :\] (2) \[S \[PP In Edinburgh\] , \[s ...\] ==ee\[s = m' , s\] (3) \[NP \[NP Bob\] , \[NP ...) , \] ==4> \[NP = NP , NP , \] In this fashion each sentence was broken down into a set of such category-patterns, resulting in a set of different categoryq)atterns for each punctuation symbol. These sets were then processed by hand to extract the underlying rule patterns from the raw category-patterns since these will include instances of serial repetition (4) and lexical 'breakthrough' in cases where phrases are not marked in the original corpus (5).</Paragraph> <Paragraph position="3"> (4) \[NP = NP, NP , NP , NP or NP\] (5) \[NP :-= each project , or activity pp\] These underlying rule-patterns represent all the ways that punctuation behaves in this corpus, and are good indicators of how the punctuation marks might behave in the rest of language. In the next sections we try to generalise these rule-patterns and discuss their possible implementation.</Paragraph> </Section> <Section position="5" start_page="604" end_page="604" type="metho"> <SectionTitle> 3 Experimental Results </SectionTitle> <Paragraph position="0"> There were 12,700 unique category-patterns extracted fl:om the corpus for the live most common marks of point punctuation, ranging from 9,320 for tile comma to 425 for the dash.</Paragraph> <Paragraph position="1"> These rules were then redu<'e<l to just lgZ underlying rule-patterns ik)r the colon, seinicolon, dash, comma, full-stop.</Paragraph> <Paragraph position="2"> Even some of these underlying rule-patterns, however, are questionable since their incidence is very low (maybe once in the whole corpus) or their: form is so linguistically strange so as to (:all into doubt their correctness (possibly idiosyncratic mis-parses), as in (6).</Paragraph> </Section> <Section position="6" start_page="604" end_page="606" type="metho"> <SectionTitle> ((3) \[ADVI'-'= PI', NP\] </SectionTitle> <Paragraph position="0"> Therefore all the patterns were= checked against the original corpus to recover the original sentences. '\['he sentences for patterns with low incidence and those whose correetne.ss was (luestionable were. careNlly examined to (letermine whether there was arty justitication for a particular rule-pattern, given the content of the seutenee.</Paragraph> <Paragraph position="1"> Taking the subset of rules relating to the coh)n, for example, shows that there are 27 underlying rule patterns from the original analysis, as shown in table 1.</Paragraph> <Paragraph position="2"> By examining all (or. a representative subset) of the. sentences in the original corpus that yield these underlying rule-patterns, the majority of them can be eliminated. The only real underlying patterns are those in table 2.</Paragraph> <Paragraph position="3"> The rest of the rule-patterns were eliminated because they represented idiosyncratic bracketings and category assignments in the original corpus, and so were covered by other rules. It should also be noted that some incorrect category assignments were made at the earlier data analysis stages, which explains why several of the revised rules have non-phrasal-level left-most daughters. Here are some examples of the inappropriate rule patterns.</Paragraph> <Paragraph position="4"> * S:NP:S -- inappropriate because the mother category should really be NP. Instances of this pattern in the corpus (7) are no different to instances of the similar rule with a NP mother and the pattern is more suited to a nominal interpretation. The problem has arisen in this case through confilsion of sentential and top categories in the grammar.</Paragraph> <Paragraph position="5"> Ahnost all items in the corpus are marked as sentences, although not all fulfil that grammatical role.</Paragraph> <Paragraph position="6"> (7) Another concern: the funds' share prices tend to swing more than the broader market.</Paragraph> <Paragraph position="7"> * NP=NP:VP all the verb phrases for this pattern were imperative ones, which can legitimately act as sentences (8). Therefor(; instances of this rule application are covered by the NP=NP:S rule.</Paragraph> <Paragraph position="8"> (8) Meanwhile stations are fuming because many of them say, the show's distributor, Viacom Inc, is giving an ultimatum: either sign new long-term commitments to buy future episodes or risk losing &quot;Cosby&quot; to a competitor.</Paragraph> <Paragraph position="9"> * VP~-VI':NP - a, case of misbracketing (9). The colon-expansion should not be bracketed as an adjunct to the ve but rather as an adjunct to the whole sentence in order to make linguistic sense.</Paragraph> <Paragraph position="10"> (9) The following were neither barred nor suspended: Stephanie Veselich Enright,</Paragraph> <Paragraph position="12"> It should be noted, however, that whilst all the twelve patterns in table 2 are valid, not all of them are normal colon expansions. There are seven exceptions. Significantly though, all the rule-patterns are in agreement with the description of colon use that can be found in publishers' style guides (Jarvie, 1992), which even cite the exceptional cases found here.</Paragraph> <Paragraph position="13"> PP~-I' :NI .... uses the colon merely to introduce a conjunctive structure (10) possibly one which is structurally separated fi'om the preceding sentence fi'agment in, say, an itemised list and that has quite linguisti- null cally complex items.</Paragraph> <Paragraph position="14"> (10) We. like climbing up: rock, trees and clift;</Paragraph> <Paragraph position="16"> to introduce conjunctive lists where the verb subcategorises for sentences or noun phrases, and also in certain writing styles to introduce direct speech (11).</Paragraph> <Paragraph position="17"> (ill) They said: &quot;We went to the party.&quot; NI'=NP:NP the only instance in the whole corpus of this pattern was a book title (12).</Paragraph> <Paragraph position="18"> It unlikely to be used more fl'equently in any other circumstances.</Paragraph> <Paragraph position="19"> (12) &quot;Big Red Contidentiah Inside Nebraska Football&quot; * PI'=PP:PP -- possibly the most productive of the excepted rules, this rule pattern provides only for a colon expansion containing a clarifying PP re-using the same preposition (13). Its use is very infl:equent, though.</Paragraph> <Paragraph position="20"> \[...\] spoke specifically of a third way: of having produced a historic synthesis of socialism and capitalism.</Paragraph> <Paragraph position="21"> category is not really a sentence (14). It is more likely to be an item in a list that is introduced by a phrase such as &quot; Views we,v aired on the following matters:&quot;. 'Fhe fi'equency of this pattern in the corpus is an artifact of its journalistic mmlre.</Paragraph> <Paragraph position="22"> (14) On China's turmoil: &quot;It is a very unhappy scene,&quot; he said.</Paragraph> <Paragraph position="23"> 4, S=:VI'ING:NP a unique rule pattern whose mother is not strictly speaking a grammatical sentence (I 5). There are two solutions the initial verbal phrase can be treated either as a sentence with a null subject or as st gerund noun-l)hrase.</Paragraph> <Paragraph position="24"> (:15) Also spurring the move to (:loth: diaper covers with wdcro fasteners that eliminate Om need for safety pins.</Paragraph> <Paragraph position="25"> By repeating this pattern elimination for all the rules, the number of rule patterns were reduced to .just 79, and more than half of these related to the comma. The rules arc shown in table 3. Since some of the pal;terns only el)ply in particular, exceptional cases, the uulnl)er of 'standar(t' rules is reduced even tim;her. Also, since many valid rule-patterns occur infrequently in the corpus, there exists the possibility that there are further valid infrequent pmlctuation patterns that do not occur in the corpus. Whilst some of these may be hyl)othesized , and incorporated it,to a formalisation, other more obscure pat;terns may be missed, and so the guidelines postulated in this paper are not necessarily exhaustiw, for the whole language.</Paragraph> </Section> <Section position="7" start_page="606" end_page="607" type="metho"> <SectionTitle> 4 l~ormalism </SectionTitle> <Paragraph position="0"> If the exceptional cases are ignored, it is relatively straightforward to postulate some generalisations about the use of the wu:ious punctuation marks.</Paragraph> <Paragraph position="1"> (',()loll expansions seem only to occur in descriptive contexts. Thus their mother category can be either NP or s, descriptive c~ttegories, rather than the active vl' or locative l'p. The mother category of a colon expansion is always the s~uJm as the category to which the adjunct is a.ttachod (the lel't-n,ost d:mghter) and this is even t.rue of many of the exceptional rule patterns if the constraint is relaxed to allow the daughter to haw~ a lower bar-level. The phrase contained within the colon-exl)ansion (right-most daughter) nnlst also be descriptive, but can be AI)JP in addition to NP and s. (Although there was no rule pattern found in the corpus that had all adjectival colon expansion with a sentential mother-category, it; is certainly possible to imagine such a sentence (16).) 'Chererore (17) can 1)deg po,~tnlatdeg(, as ;~ general colon-exl)ansion rule.</Paragraph> <Paragraph position="2"> (1(;) The cat; lay there quietly: relaxed and warm. (17) x: .t':{NPlslAl)..,} .V:{NP, S} q'he rule gencralisation for semicolons is very simI)le, since the semicolon only separates similar items (18). The possibility exists that this rule may apply to further categories such as adjeel, iwd and adverbial, although instances of this were not found in the corpus.</Paragraph> <Paragraph position="3"> (18) ,5 := S ;~&quot;; S:{NP, S, VI', 1'1'} The generalisation for the fifll-stop is also straighl, R)rward, since it ~q)plies to all categories. The only t)roblem is that it is not necessarily suitable for all I, he resulting structm-cs to 1)e referred to as sentences. The mothers should really all be top-category, since the full-stop is used to signal the end of a text-unit. Thus the generalisation in (19) is the most appropriate.</Paragraph> <Paragraph position="4"> (m) T = *.</Paragraph> <Paragraph position="5"> The dash interpolation is the first punctuation mark for which generalisation becomes slightly complicated. There appear to be two general rules, which overlap slightly. The first (20) simply states that a dash interpolation can contain an identical category to the phrase it follows. The second rule (21) extends this rule when applied to the two descriptive categories, so that a wider range of categories are permitted within the interpolation again, one of the rule-patterns permitted by (21) does not actually occur in the corpus, but does seem plausible. Note that since these rules incorporate a final dash, they will rely on Nunberg's (1990) principle of point absorption to delete the final dash if necessary.</Paragraph> <Paragraph position="6"> (20) ~ = 2) - t0- ~:{NP, S, VP, PI', ADaP} (21) g = g- { NP \] S I VP \] PP } - g:{Ne, S } The commas have tile most complicated set of rule-patterns. The generMisation seems to be that ally combination of phrasal categories is OK, so long as one of the daughter categories is identical to the mother category (22a&b). The restriction on this, and the reason why there are fewer rule-patterns for categories such as pP, ADJP and ADW', is that rules with the same daughters but more 'powerful' mother categories (e.g. sentential vs. adverbial) seem to be able to block the application of the 'less powerful' rules.</Paragraph> <Paragraph position="7"> (22) 6' = C , * C:{NP, S, VP, PP, ADJP, ADVP} d=.,C As an extension to these results of the analysis, it is relatively straight-forward to postulate the following simple rules (23-26), even though the punctuation symbols they refer to are not explicitly searched for ill this analysis, and they can in fact be verified in corpora.</Paragraph> <Paragraph position="8"> * For any sort of quotation-marks (excluding so-called &quot;Victorian Quotation&quot;). Note also that Nunberg's principle of quotetransposition is still necessary if this rule is to remain in its current form.</Paragraph> </Section> <Section position="8" start_page="607" end_page="608" type="metho"> <SectionTitle> 5 Implementation Methodology </SectionTitle> <Paragraph position="0"> The issue now arises of the best way to integrate punctuation into a NL grammar. There are three existing hypotheses to choose from. The theory of Nunberg (1990) is that punctuation should be treated in a 'text grammar' on a separate level to the lexical grammar. However, as pointed out by Jones (1994), it is difficult to see how this would be feasible in practice and there is little linguistic or psychological motivation for such a separation of lexicM text and punctuation.</Paragraph> <Paragraph position="1"> Therefore Jones (1.994) fully integrates punctuation and lexicM grammar, and in effect treats punctuation marks as clitics on words, introducing additional features into normal syntactic rules (27). riseoe and Carroll (190 ), however, point out that this rnM~es it hard to extract an independant text grammar or introduce modular semantics. Therefore their grammar keeps the punctuation and part-of-speech rules separate, but still allows them to be applied in an interleaved manner, in effect finding the happy mediuin between the two extreme approaches. Hence, additionally, their rules include the punctuation marks as distinct entities, rather than cliticising them, although they still require extra features to ensure proper application of the rules (28).</Paragraph> <Paragraph position="2"> (27) rip\[st S\] np\[st c\] np\[ t S\]'</Paragraph> <Paragraph position="4"> The most appropriate method would seem to be a combination of the two integrated methods above, combining their modularity, flexibility and power. Thus the Generalised Punctuation Rules obtained above could be encoded into a normal syntactic grammar to add punctuation capabilities. However, this will Mrnost certainly result in overgeneration of parses, as tile rules are still too flexible: they accurately describe syntactic situations where punctuation Call occur, but fail to place any constraints upon those situations.</Paragraph> <Paragraph position="5"> Itence some further theoretical work seems to be required to constrain the applicability of these rules.</Paragraph> <Paragraph position="6"> The main location for punctuation marks is likely to be with phrasal-level items, whether the marks occur before a particular phrasal item or after it. Punctuation does not seem to occur at levels below the phrasal, with one exception: punctuation is allowed to occur at any level in the context of coordination. Thus (29) represents l g represents a variable legal use of punctuation adjoining a I)hrasal item since it occurs adjacent to the AD.n' within the NP. However, in (30) there is no phrasal item for the punctuation to attach to, and so its use is unsanctioned. Conjunctive punctuation use can bc seen in (31), where although occurring below the level of NP, the pnnctuation is legal because of its eonjmmtive context.</Paragraph> <Paragraph position="7"> (29) The green, more turquoise actually, bicycle ... (30) * The, bicycle is a joy to ride.</Paragraph> <Paragraph position="8"> (31) The shark, whale and dolphin can all swim.</Paragraph> <Paragraph position="9"> To generalise, then, l)unctuation seems to have adjunctive and conjunctive functions, and the theoretical formalisation of these function will form a good method of constraining the l)arses produced with the Generalised Rules above.</Paragraph> </Section> class="xml-element"></Paper>