AUTOMATED DETERMINATION OF SUBLANGUAGE SYNTACTIC USAGE 
Ralph Grbhman and Ngo Thanh Nhan 
Courant Institute of Mathematical Sciences 
New York University 
New York, NY 10012 
Elalne Marsh 
Navy Center for Applied 1~se, arch in ~ Intel~ 
Naval ~ Laboratory 
Wx,~hinm~, DC 20375 
Lynel~ Hirxehnum 
Research and Development Division 
System Development Corpmation / A Burroughs Company 
Paofi, PA 19301 
Abstract 
Sublanguages _differ from each other, and from the "stan- 
dard Ian~age, in their syntactic, semantic, and 
discourse vrolx:rties. Understanding these differences is 
important'if -we are to improve our ability to process 
these sublanguages. We have developed a sen~.'- 
automatic ~ure for identifying sublangnage syntact/c 
usage from a sample of text in the sublanguage..We 
describe the results of applying this procedure to taree 
text samples: two sets of medical documents and a set of 
equipment failure me~ages. 
Introduction 
b A sub~age.is th.e f.oan.of ..natron." ~a~ y a oommumty ot s~ts m atm~mg a resmctea 
domain. Sublanguages differ from each other, and tron}. 
the "standard language, in their syntactic, ~antic, anti 
discourse properties. We describe ~ some rec~.t work on (-senii-)automatically determining the.syntactic_ 
properties of several sublangnages. This work m part ot 
a larger effort aimed at improving the techniques for 
parsing sublanguages. 
If we esamine a variety of scientific and technical 
sublanguages, we will encounter most of the constructs of 
the standard language, plus a number of syntactic exten- 
sions. For example, report" sublantgnag ~, such as are 
used in medical s||mmarles and eqmpment failure sum- 
maries, include both full sentences and a number of ~ag- merit forms \[Marsh 1983\]. Specific sublanguages differ 
in their usage of these syntactic constructs \[Kittredge 
1982, Lehrberger 1982\]. 
Identifying these differences is important in under- 
standing how sublanguages differ from the Language as a 
whole. It also has immediate practical benefits, since it 
allows us to trim our grammar tO fit the specific sub- 
language we are processing. This can significantly speed 
up the analysis process and bl~.k some spurious parses 
which wouldbe obtained with a grammar of Overly broad 
coverage. 
Determining Syntaai¢ Usage 
Unf .ort~natcly, a~l..uirin~ the data .about ,yn~'c usage can De very te~ous, masmuca ~ st reqmres .me 
analysis of hundreds (or even thousands) of s~. fence., for 
each new sublangnage to.be proces____~i. We nave mere- 
fore chosen to automate this process. 
We are fortunate to have available to us a very 
broad coverage English grammar, the Linguistic.St~ing Grammar \[S~gor 1981\], which hp been ex~. d~ 
include the sentence fragn~n_ ts of certain medical aria 
cquilnnent failure rcixn'm \[Marsh 1983\]. The gram,--," 
consmts of a context-~r=, component a.ugmehtc~l .by pr~ural restrictions which capture v_.anous synt.t.t ~ 
and sublanguage _semantic cons_tt'aints. "l\]~e con~- . 
component is stated in terms ot lgra.mmatical camgones 
such as noun, tensed verb, and ad~:tive. 
To be. gin .the analysis proceSS, a sample .mrpus is usmg this gr~,-=-,: .The me of generanm par~s_ 
m reviewed manually to eliminate incorrect ~. x ne 
remalningparses are then fed to a program which .cc~ts 
-- for each parse tree and .cumulatively for ~ entb'e me 
.- the number of times that each production m me 
context-free component of the grammar was applied in building the tr¢~. This yields a "trimmed" context-fr¢~ 
grammar for. the sublangua!~e (consLsting ~. ~osc pro- 
ductions usea one or more tunes), atong w~m zrequency 
information on the various productions. 
This process was initially applied to text. sampl~ 
from two Sublanguages. The .fi~s. t is a set o.x s~ pauent 
documents (including patient his.tm'y., eTam,n.ation, .and 
plan of treatment). The second m a set ot electrical 
equipment failure relxals called "CASREPs', a class of 
operational report used by the U. S. Navy \[Froscher 
1983\]. The parse file for the patient documents had 
correct parses for 236 sentences (and sentence frag- ments); the file for the CASREPS had correct parses tor 
123 sentences. We have recently applied the process, to a 
third text sample, drawn from a subIanguage very stmflar 
to the first: a set of five hospital discharge summaries , 
Which include patient histories, e~nmlnnt\[ous, and sum- 
maries of the murse of treatment in the hospital. This 
last sample included correct parses for 310 sentences. 
96 
Results 
The trimmed grarnrtl~l~ ~du~ from thc three 
sublanguage text samples were of comparable size. The 
grammar produced from the first set of patient docu- 
menU; col~tained 129 non-termlnal symbols and 248 pro- 
ductions; the grnmmar from the second set (the 
"discharge summaries") Was Slightly \]~trger, with 134 
non-termin~ds and 282 productions. The grammar for the CASREP sublanguage was slightly smaller, with 124 
non-terminal~ and 220 productions (this is probably a 
reflection of the smaller size of the CASR text sam- 
ple). These figures compare with 255 non-termlnal sym- 
bols and 744 productions in the "medical records" gram- 
mar used by the New York University Linguistic String 
Pro~=t (the "medical records" grammar iS the Lingttistic 
String Project English Grammar with extensions for sen- 
tencc fragments and other, sublanguagc specific, con- structs, and with a few options deleted). 
Figures 1 and 2 show the cumulative growth in the 
size of the I~"immed grammars for the three sublanguages 
as a function of the number of sentences in the sample. 
In Ftgure 1 we plot the number of non-term/hal symbols in the grammar as a function of sample size; in Figure 2, 
the number of productions in the ~ as a function 
of sample size. Note that the curves for the two medical 
sublanguages (curves A and B) have pretty much fiat- 
tcned out toward the end, indicating that, by that point, the trimmed grnmm~tr COVe'S a V~"y lar~ fra~on of the 
sentences in the sublanguage. (Some of the jumps in the 
growth curves for the medical grAmmarS refleet the ~vi- 
sion of the patient documents into sections (history, pl3y- sical exam, lab tests, etc.) with different syntactic charac- 
teristics. For the first few documents, wl3en a new see- 
tion bedim, constructs are encountered which did not 
appear m prior sections, thus producing a jump in the 
c11rve.) 
The sublanguage gramma~ arc substantially smaller than the full English grammar, reflecting the more lim- 
itcd range of modifiers and complements in these sub- 
languages. While the full grammar has 67 options for 
sentence object, the sublanguage grammars have substan- 
tially restricted mages: each of the three sublanguage 
grammars has only 14 object options. Further, the gram- 
mars greatly overlap, so that the three grammars com- 
bined contain only 20 different object options. While 
sentential complements of nouns are available in the full grammar, there arc no i~tanc~ of such a:~\[lstrllcfions in 
either medical sublanguage, aad only one instance in the 
CASREP sublanguage. The range of modifiers iS also 
much restricted ia the sublangu=age grammars as com- 
pared to the full grammar. 15 options for sentential 
modifiers are available in the full grammar. These are 
restricted to 9 in the first medical sample, 11 in the 
second, and 8 in the equipment failure sublangua~e. Similarly, the full English gr~mmnr has 21 options tor 
right modifiers of nouns; the sublanguage gr~mma_~S had fewer, 11 in the first medical sumple, I0 m" the second, 
and 7 in the CASREP sublanguage. Here the sub- 
language grammars overlap almost completely: only 12 different right modifiers of noun are represented in the 
three grammars combined. 
Among the options occurring in all the sublanguage 
grammars, their relative frequency varies ao~o~ding to 
the domain of the text. For example, the frequency of 
prepositional phrases as right modifiers of nouns (meas; 
urea as instances per sentence or sentence fragment) was 
0.36 and 0.46 for the two medical samples, as compared 
to 0.77 for the CASREPs. More striking was the fre- 
quency of noun phrases with nouns as modifiers of other 
nouns: 0.20 and 0.32 for the two medical ~mples, 
versus 0.80 for the CASREPs. 
We reparsed some of the sentences from the first set 
of medical documents with the trimmed grammar and, as 
~, o.bserved a considerable " speed-up. The 
t.mgumuc ~mng rarser uses a p.op-uown pa.~mg algo- 
rithm with., .ba~track~" g. A,~Ldingly , for short, simple sentences which require little backtr~.king there was only 
a small gain in processing speed (about 25%). For long, 
complex sentences, however, which require extensive 
backtracking, the speed-up (by roughly a factor of 3) was 
approximately proportional to the reduction in the 
number of productions. In addition, the ~fyequcncy of bad parses decreased slightly (by <3%) with the 
l~mmed y.mm.r (because some of the bad parses 
involved syntactic constructs which did not appear m any 
o~,,~ect parse in the sublanguage sample). 
Discussion 
As natural .lan..~,uage interfaces become more 
mature, their portability .- the ability to move an inter- face to a new domain and sublenguage -. is becoming 
increasingly important. At 8 minimllm, portability 
requires us to isolate the domain dependent information in a natural \]aDgua.~.e system \[C~OSZ 1983, Gri~hman 
1983\]. A more ambitious goal m to provide a discovery 
procedure for this information -. a procedure Wl~eh can 
determine the domain dependent information from sam- 
ple texts in the sublanguage. The tcchnklUeS described 
above provide a partial, semi-automatic discovery pro- cedure for the syntactic usages of a sublangua~.* By 
applying .these .t~gues to a small sublan~ sample, 
we ~ adapt a broad-coverage grammar tO the syntax of 
a particular sublanguage. Sub~.quont text from this sub- language caa then be i~xessed more efficiently. 
We are currently extending this work in two direc- 
tions. For sentences with two or more parses which ~ 
atisfy .both the syntactic and the sublanguage selectional 
semanu.'c) constraints, we intena to try using the/re- 
Cency information ga~ered for productions to select, a 
invol "ving the more frequent syntactic constructs.** 
Second, we are using a s~milAr approach to develop a 
discovery procedure for sublanguage selectional patterns. 
We are collecting, from the same sublanguage samples, 
statistics on the frequency of co-occurrence of particular 
sublan .guage (semantic) classes in subjeet.vedy.ob~:ct and 
host-adjunct relations, and are using this data as input to 
* Partial, because it cannot identify new extensions to the base gramme; semi-automatic, because the 
parses produced with the broad-coverage grammar • must be manually reviewed. 
* Some small experiments of this type have been 
one with a Japanese ~ \[Naga 0 1982\] with 1|mired success. Becat~ of the v~_ differ~t na- 
ture of the grammar, however, it is not dear whether this lass any implications for our experi- 
ments. 
97 
the grammar's sublanguage selectional restrictions. 
Acknowledgemeat 
This material is based upon work supported by the Nalional Science Foundation under Grants No. MCS-82- 
02373 and MCS-82-02397. 
Referenem 
\[Frmcher 1983\] Froscher, J.; Grishmau, R.; Bachenko, 
J.; Marsh, E. "A linguistically motivated approach to automated analysis of military messages." To appear in 
Proc. 1983 Conf. on Artificial Intelligence, Rochester, MI, 
April 1983. 
\[Grlslnnan 1983\] Gfishman, R.; ~, L.; Fried. man, C. "Isolating domain dependencies in natural 
language interface__. Proc. Conf. Applied Natural l~nguage Processing, 46-53, Assn. for Computational 
Linguistics, 1983. 
\[Greu 1963\] Grosz, B. "TEAM: a transportable natural-language interface 
system," Proc. Conf. Applied 
Natural Language Processing, 39-45, Assn. for Comlmta- 
fional IAnguhflm, 1983. 
\[Kittredge 1982\] Kim-edge, 11. "Variation and homo- 
geneity of sublauguages3 In Sublanguage: Jmdies of 
language in reslricted semantic domains, ed. R. Kittredge 
and J. Lehrberger. Berlin & New York: Walter de 
Gruyter; 1982. 
on and the concept of sublanguage. In $ublan~a&e: 
sl~lies of language in restricted semantic domains, ed. R. Kittredge and J. Lehrberger. Berlin & New York: 
Walter de Gruyter; 1982. 
\[Marsh 1983\] Marsh, E.. "Utilizing domain-specific 
information for processing compact text." Proc. Conf. 
ied Namra\[ Lansuage Processing, 99-103, Assn. for putational Linguistics, 1983. 
\[Nape 1982\] Nagao, M.; Nakamura, J. "A parser which learns the application order of rewriting rules." 
Proc. COLING 82, 253-258. 
\[Sager 1981\] Sager, N. Natural Lansuage lnform~on Pro- 
ceasing. Reading, MA: Addlson-Wesley; 1981. 
98 
130 
120 
110 
100 
80 
80 
90 
60 
50 
40 
30 0 
SENTENCES VS. NJ~N-TERMINRL SYHBBLS 
• ' • ' " ' ' , ' , " , • , • , • , • I • v "r 2- 
Y 
A 
, i . , . . . , I / , i . i , i , i , ) , i . 
z° ~lo 80 oo I oo 12o 14o 18o 18o zoo zzo z4o 
x 
Figure 1. Growth in thc size of the gr~mm.r 
as a function of the size of the text sample. X 
= the number of sentences (and sentence frag- 
ments) in the text samplc; ~" = the number of 
non-terminal symbols m the context-free com- 
ponent of thc ~'ammar. 
Graph A: first set of patient documents 
Graph B: second set of pat/cnt documcnts 
("discharge s-~-,-,'ics") 
Graph C: e~, uipment failure messages 
140 
130 
1:)0 
110 
100 
gO 
8O 
90 
30 
SENTENCES VS. NON-TERMINRL 5YHBBLS 
f / 
B 
SO , , • , , . . l , . . . . . . , . . . , . , . , . , . , . , . 
0 ZO 40 60 80 100 IZO 140 130 180 ZOO ZZO 240 Z60 ZSO 300 3ZO 
X 
1so 
12o 
11o 
SENTENCES VS. N~N-TERMINRL SYMBOLS 
• e • , , l • , • l , , • , , , , , , , , 
J / 
J 
. /--' 
/ 
, , v , 
lOO 
80 
)-- 80 
70 
80 
3o C 
4O 
• * , , • I s I , i , : * f , i , i • * , , * , • 
30 0 10 ZO 30 40 30 60 70 30 ~0 100 110 120 1~0 
X 
99 
30O 
200 
ZSO 
SENTENCES VS. PR°IDUCTI°JNS 
• , . \[ • , . , • . . , . , . , . , • , , . . ,.._/7 
A 
J 
..... ,,, , ~, ............... 
~0 40 6 100 12Q 140 1150 180 ZOO ZZO Z~O 
X 
Figure 2. Growth in the size of the grammar 
as a fuaction of the size of thc text sample. X 
= the number of sentences (and sentence frag- 
ments) in the text sample; Y = the number of 
productions in the context-free component of 
the grammar. 
Graph A: first set of patient documents 
Graph B: second set of pati_e~.t documents ("discharge s.~,-,,~cs ) 
Graph C: e~,. ,uipment failure messages (cAs~,Ps-) 
220 
20O 
180 
2~ 
220 
2(30 
=,- 100 
180 
Z 40 
SENTENCES VS. PRODUCTI°'INS 
", 1 , i • i • , • a , i • J , , , i , i , J . i • J . , • i , 
260 
240 
220 
200 
180 
16G 
140 
120 
lOG 
80 
80 
40 
J 
t2Q 
80 
60 , * , J . i • i , i , i . i . i . , , . , i , , , B , . . . . 
O ZO 40 60 OO 100 120 1"i0 150 150 ZOO 220 Z~O ZSO ZSO 30O 32O 
X 
SENTENCES VS. PRgDUCTI°INS 
160 
140 
100 
O0 / 
C 6O 
ZOo 10 ZO 30 40 O0 ~0 tO0 ;10 IZO 
X 
i00 
