UTILIZING DOMAIN-SPECIFIC INFORMATION FOR PROCESSING COMPACT TE~T 
Elaine Marsh 
Llngulstlc String Project 
Nee York University 
Nee York, Nee York 
ABSTRACT 
This paper Identifies the types of sentence 
fragments found In the text of two domains: medical 
records and Navy equipment status messages. The 
fragment types are related to full sentence forms 
on the basis of the elements which were regularly 
deleted. A breakdown of the fragment types and 
their distributions In the two domains Is 
presented. An approach to reconstructing the 
semantic class of deleted elements In the medical 
records Is proposed which is based on the semantic 
patterns recognized In the domain. 
I. INTRODUCTION 
A large amount of natural language Input, 
whether to text processing or questlon-answerlng 
systems, conslsts of shortened sentence forms, 
sentence nfragmentsn. Sentence fragments are found 
in informal technical communlcatlons, messages, 
headlines, and In telegraphic camunlcatlons. 
Occurrences are characterized by thelr brev lty and 
Informational nature. In all of these, if people 
are not restricted to using complete, grammatical 
sentences, as they are In formal writing 
situations, they tend to leave OUt the parts of the 
sentence which they belleve the reader will be able 
to reconstruct. Thls is especially true if the 
writer deals wlth a specialized subject matter 
where the facts are to be used by others in the 
same field. 
Several approaches to such hill-formed,, 
natural language Input have been followed. The 
LIFER system \[Hendrlx, 1977; Hendrlx, et ai., 1978\] 
and the I~_ANES system \[Waltz, 1978\] both account 
for fragments In procedural terms; they Co not 
require the user to enumerate the types of 
fragments which will be accepted. The Linguistic 
Strlng Project has characterlzed the regularly 
occurring ungrammatical constructions and made them 
pert of the parsing grammar \[Anderson, et el., 
1975; Hlrschman and Sager, 1982\]. Kwasny and 
Sondhe!mer (10R1) have used error-handling 
procedures to relate the Ill-formed input of 
sentence fragments to well-formed structures. 
While these approaches differ in the way they 
determine the structure of the fragments and the 
deleted material, for the most pert they rely 
heavily, at some point, on the recognition of 
semantic word-classes. The purpose of this paper 
Is to describe the syntactic characteristics of 
sentence fragments and to lllustrate how the 
domeln-speclflc Information embodied In the 
cooCcurrence patterns of the semantic word-classes 
of a domain can be utilized as a powerful tool for 
processing a body of compact text, I.e. text that 
contains a large percentage of sentence fragments, 
II. IDENTIFICATION OF FRAGMENT TYPES 
The Nee York Unlversl~y Linguistic String 
Project has developed a computer program to analyze 
ccmpact text In special Ized subject areas using a 
general parsing program and an Engl Ish grammar 
augmented by procedures speclflc to the subJect 
areas. In recent years the system has been 
tailored for computer analysis of free-text medical 
records, which are characterized by numerous 
sentence fragments. In the computer-analysis and 
processing of the medical records, relatlvely few 
types of sentence fragments sufflced to describe 
the shortened forlas, a l though such fragments 
ccmprfsed fully 49% of the natural language input 
CMarsh and Sager, 1982\]. Fragment types can be 
related to full forms on the basis of the elements 
which are regularly delirfed. Elements deleted fr~n 
the fragments are fr~a one or more of the syntactic 
posltlons: subject, tense, verb, obJect. The six 
fragment types Identlfled in the set of medical 
records are shown In Table 1 as types i-Vl. 
A feature of fragment types that Is not 
Imedlately obvious ts the fact that they are 
already known In the ful I grammar as parts of 
ful let constructions. The fragment types reflect 
deletions found in syntactically distinguished 
positions wlthin full sentences, as Illustrated in 
Table 2. For e~ample, In normal English, a sentence 
that contains tense and the verb be can occur as 
the object of verbs like find (e.g. She found that 
the sent~ce was ~). In the same 
environment, as obJect of find, a reduced sentence 
can occur \[n which the tense and verb be have been 
omitted, as In fragment type I (e.g. She found the 
sentence ~lllJ;~). In the same manner, other 
reduced forms reflected in fragment types also 
represent constructions generally found as ~arts of 
regular English sentences. 
The fact that the fragment types can be 
related to full English forms makes It possible to 
v Iee thee as Instances of reduced 
SURJECT-VEI~-(~JECT patterns free which particular 
components have been deleted. Fragments of type I 
can be represented as having a deleted tense and 
verb be, of type II as having a deleted subject, 
tense, and verb be, etc. This makes it relatively 
straightforward to add thee to the parslng grammar, 
99 
!i ! ! i ~ o .~ ~ ~ ~ - . ~o~ ~ ~ ~ ~ 
w 
I-- Z>- 
(Du 
3~ 
We'~ 
W 
f J-r" 
"~0 
i, 
~E 
-CE 
m~ 
cn 
uJ 
>- ).. 
I-- 
z 
n. 
u. 
w 
..J 
I-- 
~ ° ~ ~ 
°° ~ ~ E .- ~ ~ ~ o ~ o _ ~ ~ ~ 
-- O 
- = ~ 
..J 
=\[ ee ¢ ~ 
d 
Z 
d 
O 
Z 
W 
111 
I'-- 
Z 
*(j 
,..J I=. 
,,.I m 
w 
m 
+ 
Z 
N 
~4 
L~ 
~. ,,( 
+ + 
Z Z 
~4 d 
Z 12. O" Z 
+ + + 
2: Z Z 
U.I 
e, e~ 
el*" 
n 
zo. 
I,..- .,.J I,.- 
.g~ -g 
mw ~ 
m 
L4,4 
-> 
~-_ > 
Z -- :=" 
I'--Z i-" I~ 
I00 
TABLE 2. DELETION FORMS IN NORMAL ENGLISH 
I. DELETED TENSE, VERB BE 
A. N + PASSIVE PRED 
B. N + PROGRESSIVE PRED 
C. N + ADJECTIVE PRED 
O. N+PN 
E. N+Q 
F. N+N 
THE KING HAD HIM B~FAI:)~. 
WE 0eSERVED EILL .T~ TO HIMSELF. 
SHE FOUNO THE S~NT~CE ~. 
THEY FOUND HiS J~ OF JJ~. 
JOHN THOUGHT HIM Z~ OR ~. 
THEY CONSIDERED HER THEIR SAVIOUR. 
II. DELETED SUBJECT, TENSE, VE\]~ BE \[VERBAL PREDICATE\] 
A. PASSIVE PREDICATE 
B. PROGRESSIVE PREDICATE 
THE MAN, ~ WITH HIS WORK, WENT HOME. 
MARY LEFT WHISTt I~K~ ~ HAPPY ~NE. 
III. DELETED SUBJECT, TENSE, VERB BE 
A. ADJECTIVE PREDICATE 
B. PN PRED I CATE 
GRACIOUS AS ~LEB, SHE WELCOMED HER GUESTS. 
THE GUARD, IN GREAT /~./~M, CALLED THE POLICE. 
IV. DELETED SUBJECT, TENSE, VERB BE 
NO.IN PHRASE THE CHILD, 6 ~UM~Y DANCER, TWISTED HER ANKLE. 
V. Ol~ ETED SUBJECT, TENSE, VERB BF 
INFINITIVAL PREDICATE THEY TOOK THE TRAIN TO AVOID THE TRAFFIC. 
and, at the same time, provides a framework for 
Identifying their semantic content by relating thm 
to the corresponding full forms. 
The number of fragment types that occur In 
compact text of different technical domains appears 
to be relatlvely limited. When the fragment types 
found In medical records were compared wlth those 
seen In a smell sample of Navy equipment status 
messages, five of the slx types found in the 
medlcal records were also found In the Navy 
messages. Only one additional fragment type was 
required to cover the Navy messages. This type 
appears In Table I as type Vll, in which two 
subjects have been deleted (Reauest advise 
for Dick ~Q.). 
While the number of fragment types Is 
relatively constant, the distribution of fragment 
types varies according to the domain of the text. 
Table 3 shows distributions for each of the 
fragment types Identified in Table 1. For e~ample, 
In Table 3, while fragment type IV, from which 
subject, tense, and verb have been deleted, is most 
frequent In medical records, It is a much less 
frequent type In the Navy messages. On the other 
hand, type VI, from whlch a subject has been 
deleted, Is relatively Infrequent In medical 
records, but much more frequent in Navy messages. 
In addition, the different sections of the 
input differ with respect to the ratio of fragments 
1-o whole sentences and in the types of fro~ments 
101 
they contain. For e~unple, the different sections 
of the medical records that were analyzed (e.g. 
HISTORY, EXAM, LAB-DATA, IMPRESSION, COURSE IN 
HOSPITAL) were distinguished by differences in the 
distribution of the fragment types. The EXAM 
paragraph of the medical texts, In which the 
physician describes the results of the patient's 
physical eK~lnatlon, contained a relatively large 
number of fragments of type I11, especially 
adjective phrases. The COURSE IN HOSPITAL 
paragraph contained a larger number of complete 
sentences than the other paragraphs. 
TABLE \]. DISTRIBUTION OF FRAGMENT TYPES 
TYPE MEDiCAl NAVY 
I. 22% 36% 
if. I% 6% 
iii. 12% 11% 
IV. 61% 15% 
v. I% 0% 
vl. 2$ 28% 
vtl. 0% 4~ 
III. RECONSTRUCTION OF DELETIONS 
The deletions which relate fragment types to 
their full sentence forms fall Into two main 
classes: (I) those found virtually In all texts and 
Ill) those speclflc to the domain of the text. 
Just as the fragment types can be viewed as 
Incomplete realizations of syntac-Nc S-V-O 
structures, the semantic patterns In sentence 
fragments can be considered Incomplete reallzatlons 
of the semantic S-V-O patterns. In general terms, 
the structure of Information In technical domains 
can be specified by a set of semantlc classes, the 
words and phrases which belong to these classes, 
and by a speclflcatlon of the pal'~erns these 
classes enter in'to, l.e. the syntactic 
relationships among the members of +he classes 
\[Grlshmen, et el., 1982; Sager, 1978\]. In +he case 
of the medical sublenguage processed by the 
Llngulstlc StTlng Project, the medical subclasses 
were derlved through techniques of distributional 
analysis \[Hlrschmen and Sager, 1982\]. Semantlc 
S-V-O pet-I'erns were then derived from the 
comblnatory properties of the medical classes in 
the text \[Marsh and Sager, 1982\]; +he semantic 
pat~rerns Identified In a text are specific to the 
domain of +he text. Whlle they serve to formulate 
sublanguage constraints which rule out incorrect 
syntactic analyses caused by structural or l exlcal 
ambiguity/, these relationships among classes can 
also provide a means by which deleted elements in 
compact text can be reconstructed. When a fragment 
Is recognized as an Instance of a given semantic 
pattern, It Is +hen possible to specify a set of 
the semantic classes from which the medical 
sublanguage class of +he deleted element can be 
selected. 
On a superflclal level, the deletions of be In 
fragment types Ic-f and Ilia-b, for example, can be 
reconstructed on purely syntac~'lc grounds by 
fllllng In the l exical Item be. However, It Is 
also possible to provide further Information and 
specify the semantic class of the lex lcal Item be 
by reference to the semantlc S-V-O pat-tern 
manifested by the occurring subject and object. 
For e~emple, In type If fragment skin no ~ruotlons, 
skin has the medical subclass BODYPART, and 
eruntlons has +he medlcal subclass SIGN/SYMFrrOM. 
The semantic S-V-O pat-tern In which these classes 
play a part Is= 
BODYPART-SHOWVERB-SIGN/SYMPTOM 
(as In Skln showed no eruntlons). Be can then be 
assigned the semantic class SHOWVERB. protein ~, 
type It, enters Into the semantic pal-~ern: 
TEST-~STVERB-TES13~ESULT 
and be can be assigned the class TESI~/ERB, which 
relates a TEST subject wlth a TESllRESULT object. 
Assigning a semantic class to the reconstructed be 
maximizes Its Informational content. 
In addition to reconstructing a dlstlngulshed 
l exlcal Item, like +he verb be, along with Its 
semantic classes, It Is also possible to specify 
the set of semantic classes for a deleted element, 
even +hough a l exlcal Item Is not Immediately 
reconstructable. For e~emple, the fragment To 
recelv9 follc ~,J.~o of Type VI, contains a verb of 
the PI~/ERB" class and a MEDICATION-obJect, but the 
subject has b~n deleted. The only semantic 
pad-tern which permits a verb and object wlth these 
medical subclasses Is the S-V-O pattern: 
PATIENT-PTVERB-MEDICATION 
Through recogn{tlon of the semantic pattern in 
which +he occurring elements of the fragment play a 
role, the semantic class PATIENT can be specified 
for +he deleted subject, p~tlent Is one of the 
distinguished words In the domain of narrative 
medical records which are often not explicitly 
mentloned In the text, although they play a role In 
the sementlc patterns. 
The S-V-O relations, of which the fragment 
i~/pes are Incomplete realizations, form the basis 
of a procedure which specifies the semantic classes 
of deleted elements In fragments. Under the best 
conditions, the set of semantic classes for the 
deleted form contains only one element. It Is also 
possible, however, for the set to contain more than 
one semantic class. For example, the t~fpe la 
fragment Pain also noted }n hands ~ knees, when 
regularized to normal active S-V-O word order as 
noted oaln In hands and knees, has a deleted 
subject. The set of possible medical classes for 
the deleted subject consists of ~PATIENT, FAMILY, 
OocrrOR}, since • fragment with a verb of the 
OBSERVE class, such as note, and an object of the 
SIGN/SYMPTOM class, such as oaln, can enter ~rtc 
SUBJECT VERB OBJECT 
FAMILY OBSERVE SIGN/SYMPTOM 
PATIENT OBSERVE SIGN/SYMPTOM 
DOCTOR OBSERVE SIGN/SYMPTOM 
(MO~ ~SERV~ F~ER. ) 
(p_~ OBSFRV~ F~ER.) 
(OOCTOR OBSERVED F~ER,) 
FtGURE 1. EXN~IPLES OF SUBJECT-VERB-CBJECT PAl-FERNS 
102 
any of the S-V-O patterns In Figure I, The choice 
of one subclass for the deleted element from among 
elements of the set of possible subclasses Is 
dependent on several factors. First, properties of 
paragraph structure of the text place restrictions 
on the selection of semantic class for a deleted 
element. The fragment noted oaln In h~ds and 
knees would select a DOCTOR subject If written In 
the IMPRESSION or EXAH paragraph of the text, but, 
In the HISTORY paragraph, a PATIENT or FAMILY 
subJect could not be excluded. A second factor Is 
the presence of an antecedent having one of the 
semantic classes specified for the deleted element. 
If a possible antecedent having the same sGmsntlc 
class can be found, subJect to restrlctlons on 
change of topic and discourse structure, then the 
deleted element can be filled In by Its antecedent, 
restricting the sementlc class of the deleted 
element to that of the antecedent. Hoaever, an 
antecedent search may not always be successful, 
since the antecedent may not have been expllc\[tly 
mentioned In the text. The antecedent may be one 
of a class of distinguished words In the 
sublanguage, such as natlent and .~, which may 
not be previously mentioned In the body of the 
text. 
Thus, semantic patterns derived from 
dlstrlbut\[onal analysis permit the specification of 
a set of semantic classes for deleted elements In 
texts cheracterlzed by a large proportion of 
sentence fragments. This speclflcatlon can 
facilitate the reconstrucffon of deleted elements 
by limiting choice among possible antecedents. 
IV. CONCLUSION 
In this paper, seven deletion patterns found 
In technical compact text have been Identified. 
The number of fragment types Is relatively limited. 
Five of the seven occur In the full grammar of 
English as subparts of fuller structures. These 
syntactic fragment types can be vlewed as 
Incomplete realizations of syntactic 
SUBJ ECT-VERB-ORJECT structures; the semantic 
patterns In sentence fragments are found to be 
Incomplete reallzatlons of the semantic 
SUBJECT-VER\]-OBJECT pal-ferns found In full 
sentences. Semantic classes can be speclfled for 
deleted elements In sentence fragments based on 
these semantic patterns. 
AC~N~/L EDGIqENTS 
Thls research was supported In part by 
National Science Foundatlon grant number 
IS1-/9-20788 frcm the Division of Information 
Science and Technology, and in part by National 
Library of Hedlclne grant number 1-RO1-LM03953 
awarded by the National Institute of Health, 
Oepert~ent of Health and Human Servlces. 

REFERENCES 

Anderson, B., Bross, I. D. J. and N. Sager (1975). 
Grammatical Compression In Notes and Records. 
American Journal of Computational Linguistics 
2:4. 

Grishman, R., Hirschman, L., and C. Friedman 
(1982). Natural Language Interfaces Using 
Limited Semantic Information. Proceedings of 
9th International Conference on Computational
Linguistics (COLING 82), Prague, 
Czechoslovakia. 

Hendrlx, G. (1977). Human Engineering for Applied 
Natural Language Processing. Proceedlnas of 
5+h IJCAI, Cambridge, Hass. 

Hendrlx, G., Sacerdotl, E., Sagalowlcz, 0., and J. 
Slocum (1978). Developing a Natural Language 
Interface to Complex Data, ACH TOOS ~:2. 

HIrschman, t. and N. Sager (1982). Automatic 
Information Format-ling of a Medical 
Sublanguage. Sublenauaae¢ ~tudles of Lanaueae 
In R~Ic+Qd ~ ~ (R. Klt-fredge 
and J. Lerberger, ads.). Waiter de Gruyter, 
Berlin, 

Kwesny, S. C. and N. K. Sondhelmer (1981). 
Relaxation Techniques for Parsing Ill-formed 
Input. American Journal of 
Computational Linguistics 7:2. 

Marsh, E. and N. Sager (1982). Analysis and 
Processing of Compact Text. Proceedings of 
the 9th International Conference on 
Computational Linguistics (COLING 82), Prague, 
Czechoslovakia. 

Sager, N. (1978). Natural Language Informatlon 
Formal-ling= The Autanatlc Conversion of Texts 
to a Structured Data Base. In Advances In 
17 (N.C. Yovlts, ed.), Academic 
Press, Nee York. 

Wall~, D. (1978). An English Language Ouestlon 
Answering System for a Large Relational Data 
Base, CAC~ 21:7. 
