Closed Yesterday and Closed Minds: 
Asking the Right Questions of the Corpus 
To Distinguish Thematic from Sentential 
Relations 
Uri Zernik 
Artificial Intelligence Laboratory 
General Electric - Research and Development Center 
Abstract 
Collocation-based tagging and bracketing pro- 
graras have attained promising results. Yet, 
they have not arrived at the stage where they 
could be used as pre-procezsors for full-fledged 
parsing. Accuracy is still not high enough. 
To improve accuracy, it is necessary to inves- 
tigate the points where statistical data is being 
misinterpreted, leading to incorrect results. 
In this paper we investigate inaccuracy which 
is injected when a pre-processor relies solely on 
collocations and blurs the distinction between 
two separate relations: thematic relations and 
sentential relations. 
Thematic relations are word pairs, not nec- 
essarily adjacent, (e.g., adjourn a meeting) that 
encode information at the concept level. Sen- 
tential relations, on the other hand, concern 
adjacent word pairs that form a noun group. 
E.g., preferred stock is a noun group that must 
be identified as such at ttle syntactic level. 
Blurring the difference between these two 
phenomena contributes to errors in tagging of 
pairs such as ezpressed concerns, a verb-noun 
construct, as opposed to preferred stocks, an 
adjective-noun construct. Although both re- 
lations are manifested in the corpus as high 
mutual-information collocations, they possess 
difl'erent prot)erties and they need to be sepa- 
raled. 
In our method, we distinguish between these 
two cases by asking additional questions of the 
corpus. By definition, thematic relations take 
on filrther variations in the corpus. Expressed 
concerns (a thematic relation) takes concerns 
expressed, expressing concerns, express his con- 
cerns ere. On the other hand, preferred stock 
(a sentential relation) does not take any such 
syntactic variations. 
We show how this method impacts pre- 
processing and parsing, and we provide em- 
pirical results based on the analysis of an 80- 
million word corpus. I 2 
Pre-Processing: The Greater 
Picture 
Sentences in a typical newspaper story in- 
clude idioms, ellipses, and ungrammatic con- 
structs. Since authentic language defies text- 
book grammar, we must rethink our basic pars- 
~This research was sponsored (in part) by 
the Defense Advanced Research Project Agency 
(DOD) and other government agencies. The views 
and conclusions contained ill this document are 
those of the authors and should not be inter- 
preted as representing the official policies, either 
expressed or implied, of the Defense Advanced Re- 
search Project Agency or the US Government. 
2We thank ACL/DCI (Data Collection Initia- 
tive), the Collins publishing company, and the 
Wall Street Journal, for providing invaluable on- 
line data. 
ACTES DE" COL1NG-92. NANTES, 23-28 AOUr 1992 1 3 0 5 PROC. OF COLING-92. NANTES, AUG. 23-28. 1992 
\[Separately/av\] *comma*/ec \[Kaneb/nm Serviccs/nn\] \[said/vb\] \[holders/nn\] \[of/pp its/dt 
Class/nn h/aj preferred/aj stock/nn\] *comma*/cc \[failed/vb\] \[to/pp elect/vb\] \[two/aj di- 
rectors/nn\] \[to/pp the/dt company/nn board/un\] when/co \[the/dt annual/aj meeting/nn\] \[re- 
sumed/vb\] \[Tuesday/aj\] becanse/cc there/cc are/ax \[questions/nn\] as/cc \[to/pp the/dr valid- 
ity/nn\] \[of/pp the/dt proxies/nn\] \[submitted/vb\] \[for/pp review/nn\] \[by/pp the/dt group/nn\] 
*period*/cc 
\[The/dt ..... pany/nn\] \[adjourned/vb\] \[its/pn annual/aj meeting/nn\] Uay/nm 12/aj\] \[to/pp 
allow/vb\] \[time/nn\] \[for/pp negotiations/nn\] and/cc \[expressed/vb\] \[concern/nn\] \[about/pp 
future/aj actions/nn\] \[by/pp preferred/vb holders/nn\] *period*/cc 
Figure 1: Pre-processed Text Produced by NLcp 
ing paradigm and tune it to the nature of the 
text under analysis. 
Hypothetically, parsing could be performed 
by one huge unification mechanism \[Kay, 1985; 
Shieber, 1986; Tomita, 1986\] which would pro- 
cess sentences at any level of complexity. Such 
a mechanism would recieve its tokens in the 
form of words, characters, or morphemes, ne- 
gotiate all given constraints, and produce a full 
chart with all possible interpretations. 
However, when tested on a real corpus, (i.e., 
Wall Street Journal (WSJ) news stories), this 
mechanism fares poorly. For one thing, a typ- 
ical well-behaved 34-word sentence produces 
hundreds of candidate interpretations. In ef- 
fect the parsing burden is passed onto a post 
processor whose task is to select the appropri- 
ate parse tree within the entire forest. 
For another, ill-behaved sentences - roughly 
one out of three WSJ sentences is problematic 
- yield no consistent interpretation whatsoever 
due to parsing failures. 
To alleviate problems associated with rough 
edges in real text, a new strategy has emerged, 
involving text pre-processing. A pre-processor, 
capitalizing on statistical data \[Church el aL, 
1989; Zernik and Jacobs, 1990; Dagan et al., 
1991\], and customized to the corpus itself, 
could abstract idiosyncracies, highlight regu- 
larities, and, in general, feed digested text into 
the unification parser. 
What is Pre-Processing Up 
Against? 
The Linguistic Phenomenon 
Consider (Figure 1) a WSJ (August 19, 1987) 
paragraph processed by NLpc (NL corpus pro~ 
eessing) \[Zernik el aL, 1991J. Two types of lin- 
guistic constructs must be resolved by the pre- 
processor: 
Class A preferred/AJ stock/NN 
*comma* 
and expressed/VB eoneern/NN about 
How can a program determine that preferred 
stock is an adjective-noun, while expressed con- 
cern is a verb-aoun construct? 
The Input 
The scope of the pre-processing task is best 
illustrated by the input to the prc-processor 
shown in Figure 2. 
This lexical analysis of the sentence is based 
on the Collins on-line dictionary (about 49,000 
lexical entries extracted by NLpe) plus mor- 
phology. Each word is associated with candi- 
dales part of speech, and almost all words are 
ambiguous. The tagger's task is to resolve the 
ambiguity. 
For example, ambiguous words such as ser- 
vices, preferred, and expressed, should be 
tagged as noun (nn), adjective (aj), and verb 
(vb), respectively. While some pairs (e.g., an- 
nual meeting) can be resolved easily, other pairs 
At'l-ms DE COLING-92, NANTES, 23-28 AO~-r 1992 1 3 0 6 PROC. OF COLING-92, NANTES, AUG. 23-28, 1992 
Separately AV 
said AJ VB 
its DT 
preferred AJ VB 
to PP 
directors NN 
company NN 
annual AJ 
tuesday NM 
proxies NN 
Kaneb NM Services NN VB 
holders NN of PP 
Class AJ NN A DT AJ 
stock NN VB failed AI) VB 
elect VB two AJ NN 
to PP the DT 
hoard NN VB when CC 
meeting NN VB resumed AJ VB 
questions NN VB validity NN 
submittedAJ VB group NN VB 
Figure 2: Lexical Analysis of Sentence: Words plus Part of Speech 
(e.g., preferred stock and e~:pressed concerns) 
are more difficult, and require statistical train- 
ing. 
Part-Of-Speech Resolution 
The program can bring to bear 3 types of clues: 
Local context: Consider the following 2 cases 
where local context donfinates: 
1. the preferred stock raised 
2. he expressed concern about 
The words the and he dictate that preferred 
and expressed are adjective and verb respec-- 
tively. This kind of inference, due to its local 
nature, is captured and propagated by tile 
pre-processor. 
Global context: Global-sentence constraints 
arc shown by the following two examples: 
1. and preferred stock sold yesterday 
1Nns ... 
2. and expressed concern abouL 
•.. *period* 
In case 1, a main verb is found (i.e., was), and 
preferred is taken as art adjective; in case 2, 
a main verb is not found, and therefore ez- 
pressed itself is taken as the main verb. This 
kind of mnbiguity requires fidl-fledged uni- 
fication, and it is not bandied by the pre- 
processor. Fortunately, only a small percent 
of the cases (in newspaper stories) depend on 
global reading. 
Corpus-based prefereltce: Corpus analysis 
(WSJ, 80-million words) provides word- 
association preference \[Beckwith el at., 1991\] 
collocation total vb-nn aj-nn 
preferred stock 2314 100 O 
expressed concern 318 1 99 
The construct expressed concern, which ap- 
pears 318 times in the corpus, is 99% a verb- 
noun construct; on tile other hand, preferred 
stock, which appears in the corpus 2314 
times, is 99% an adjective-norm construct. 3 
Where Is The Evidence? 
The last item, however, is not directly avail- 
able. Since the corpus is not a-priori tagged, 
there is no direct eviderLcc regarding part-of- 
speech. All we get from the corpus are num- 
bers that indicate the mutual information score 
(MIS) \[Church el al., 1991\] of collocations (9.9 
and 8.7, tbr preferred stock and expressed con- 
cern, respectively). It becomes necessary to in- 
fer the nature of the combination from indirect 
corpus~based statistics as shown by the rest of 
this paper. 
3For expository psrposes we chose here two ex- 
treme, clear cut cases; other pairs (e.g., promised 
money) are not totally biased towards one side or 
another. 
ACIES DE COLING-92, NANTES, 23-28 AOt\]T 1992 1 3 0 7 PRO(:. O1: COLING-92, NANTES, AUG. 23-28, 1992 
Inferring Syntax from 
Collocations 
In this section we describe the method used for 
eliciting word-association preference from the 
corpus. 
Initial Observation: Co-occurrence 
Entails Sentential Relations 
The bazic intuition used invariably by all ex- 
isting statistical taggers is stated as follows: 
Significant collocations (i.e., high MIS) predict 
syntactic word association. Since, for example, 
preferred stock is a significant collocation (mis 
9.9), with all other clues assumed neutral, it 
will be marked as an integral noun group in 
the sentence. 
However, is high mis always a good predic- 
tor? Figure 3 provides mutual information 
scores for preferred, expressed, and closed right 
collocations. 
The first column (preferred) suggests mis is 
a perfect predictor. A count in the corpus con- 
firms that a predictor based on collocations is 
always correct. A small sample of preferred col- 
locations in context is given Figure 4. Notice 
that in all eases, preferred is an adjective. 
Next Observation: Co-occurrence 
Entails Thematic Relations 
While column 1 (preferred) yields good syntac- 
tic associations, column 2 (ezpressed) and col- 
umn 3 (closed) yield different conclusions. It 
turns out (see Figure 4) that expressed colloca- 
tions, even collocations with high mis, produce 
a bias towards false-positive groupings. 4 
If these collocation do not signify word 
groupings, what do they signify? An obser- 
vation of expressed right collocates reveals that 
the words surprise, confidence, skepticism, op- 
timism, disappointment, support, hope, doubt, 
4Word associations based on corpus do not dic- 
tate the nature of word groupings; they merely pro- 
vide a predictor that is accounted for with other 
locaJ-context clues. 
worry, salisfaclion, etc., are all thematic rela- 
tions of express. 
Namely, a pair such as expressed disappoint- 
ment denotes an action-object relation which 
could come in many variants. The last part of 
Figure 4 shows various combinations of express 
and its collocates. 
Using Additional Evidence 
In light of this observation, it is necessary to 
test in the corpus whether collocations are fixed 
or variable. For a collocation wordl-word2, if 
wordl and word2 combine in multiple ways, 
then wordl-word2 is taken as a thematic re- 
lation; otherwise it is taken as a fixed noun 
group. 
This test for ezpress~word is shown in Figure 
5. Each row provides the number of times each 
variant is found. Variants for expressed con- 
cerns, for example, are concerti expressed, ex- 
press concern, ezpresses concern, and express. 
ing concern. Not shown here is the count for 
split co-occurrence \[Smadja, 1991\], i.e., express 
its concern, concern was expressed. The last 
column sums up the result as a ratio (variabil- 
ity ratio) against the original collocation. 
In conclusion, for 12 out of 15 of the checked 
collocations we found a reasonable degree of 
variability. 
Making Statistics Operational 
While the analysis in Figure 5 provides the mo- 
tivation for using additional evidence, we have 
two steps to take to make this evidence useful 
within an operational tagger. 
Dealing with Small Numbers 
Although the table in Figure 5 is adequate 
for expository purposes, in practice the differ- 
ent collected figures are spread over too many 
rubrics, making the numbers susceptible to 
noise. 
To avoid this problem we short-cut the calcu- 
lation above and collect all the co-occurrence of 
AcrEs DE COLING-92, NANTES, 23-28 AO~r 1992 I 3 0 8 PROC. OF COLING-92, NANTES, AUG. 23-28, 1992 
9.9 preferred stock 11.9 expressed disappointment 20.4 cloud friday 
9.8 preferred dividend 11.6 expressed skepticism 17.4 closed monday 
8.1 preferred share 10.8 expressed optimism 16.3 closed tuesday 
7.4 preferred method 10.8 expressed reservations 16.0 closed thursday 
7.4 preferred holders 10.1 expressed doubt 16.0 closed today 
7.0 preferred stockholders 10.0 expressed surprise 15.7 closed wednesday 
7.0 preferred shareholders 10.O expressed satisfaction 15.5 closed saturday 
6.1 preferred issue 9.6 expressed confidence 13.8 closed tomorrow 
5.2 preferred units 8.9 expressed shock 13.8 closed mouthed 
5.0 preferred series 8.8 expressed hope 8.1 closed minded 
4.7 preferred equity 8.7 expressed concern 8.0 closed caption 
4.6 preferred closed 8.7 expressed worry 7.7 closed milieu 
4.5 preferred customer 8.6 expressed relief 7.5 closed doors 
4.1 preferred course 8.2 expressed interest 7.4 closed yesterday 
3.7 preferred product 7.0 expressed supt>ort 6.8 closed dumps 
Figure 3: Right-Collocations for |'referred, Expressed, and Closed 
the roots of the words under arralysis. In- 
stead of asking: "what are the individual varL 
ants?" we ask "what is the total co-occurrence 
of the root pair?". For expressed concerns we 
check the incidence of czpress-in~eresl (and of 
interest-express). 
As a result, we get the lump sum without 
summing up the individual numbers. 
Incorporating Statistics in Tagging 
Co-oecurence information regarding each pair 
of words is integrated, as described in Section 
2.3, with other local-context clues. Titus, the 
fact that statistics provide a strong preference 
can always be overidden by other factors. 
they preferred stock ... 
the expressed interest by shareholders was 
In both these cases the final call is dictated by 
syntactic markers in spite of strong statistical 
preference. 
Conclusions 
NLpc processes collocations by their category. 
In this paper, we investigated specifically the 
PastParticiple-Noun category (e.g., preferred- 
stock, expressed-concerns, etc.). Other cate- 
gories (in particular ContinuousVerb-Noun as 
ill driving cars vs. operating systems) are pro- 
cessed in a similar way, using slightly different 
evidence and thresholds. 
The Figures 
qbtal cases: 2031 
Applicable cases: 400 
Insufficient data: 23 
Incorrect tagging: 19 
Correct tagging: 3,58 
Evaluation 
Out of 2031 tagging cases counted, the algo- 
rithm was called in 400 eemes. 1631 cases were 
not called since they did not involve colloca- 
tions (or involved trivial collocations such as 
ezpressed some fears.) Out of 400 collocations 
the program avoided ruling in 23 cases due to 
insufficient data. Within the 377 tagged cases, 
358 (94.9%) cases were correct, and 19 were 
incorrect. 
90% Accuracy is Not Enough 
Existing pre-processors \[Church et al., 1989; 
Zernik et al., 1991\] which have used corpus- 
based collocations, have attained levels of ac- 
ACRES BE COLING-92, NANTES, 23-28 Ao~r 1992 1 3 0 9 I'~toc. OF COLING-92, NANTES, Aut~. 23-28, 1992 
GE for the 585,000 shares of its 
ume payments of dividends on the 
oha,k but lowered ratings on its 
n* 3 from BAA *hyphen* 2 *comma* 
llar* 26,65 a share *period* The 
axes of common for each share of 
0 *pc* of Vaxity *ap* common and 
ng of up to *dollar* 250 million 
erms of the transaction call for 
sal *comma* to swap one share of 
i *dollar* 2 million annually in 
p* notes and 7,459 Lori series C 
a share of nevly issued series A 
ante an adjustable *hyphen* rate 
id he told the house Mr. Dingell 
ggested that the U.S. Mr. Harper 
ne tax *period* Some legislators 
soybeans and feed grains *comma* 
bid *dash* *dash* *dash* GE unit 
hallenge *period* Mr. Wright has 
bt about their bank one also had 
italy *ap* President Cossiga and 
*comma* sayin 6 earner executives 
secretary Robert Mosbacher have 
thor on the nature paper *comla* 
eber who *comma* he said *comma* 
ring gold in the street and then 
said that Mational Pizza Co. has 
r. nixes *comma* Chinese leaders 
e Bay Area *ap* pastry community 
presidents also are expected to 
its predecessor *period* It also 
related Services Co. people she 
c chairman Seidman *comma* vhile 
* on a tour of asia *comma* also 
ponsored the senate plan *comma* 
the nine supreme court justices 
nd primerica in his eagerness to 
st few eeeks alone *dash* *dash* 
iterally flipped his wig *comma* 
that the neuspaper company said 
she no longer feel they have to 
icans mriting to the hostages to 
en stmmoned to chairman Gonzalez 
tied* Frequently *comma* clients 
preferred 
preferred 
preferred 
preferred 
preferred 
preferred 
preferred 
prefelTed 
preferred 
preferred 
preferred 
preferred 
preferred 
preferred 
stock outstanding *period* The e 
stock in January *period* It sue 
stock and commercial paper *comm 
stock to ba *hyphen* 2 from BAA 
is convertible until 5 P.M.. EDT 
*r-paten* *period* Cash sill be 
shares outstanding *period* The 
shares *period* Tells of the tra 
holders *comma* she previously a 
stock for 1.2 Abates of common n 
dividends *period* Aftra owns 68 
shares ~ith a carrying value of 
stock with a value equal to *dol 
stock ~hose auction failed recen 
expressed 
expressed 
expressed 
expressed 
expressed 
expressed 
expressed 
expressed 
expressed 
expressed 
expressed 
expressed 
concern *comma, sources said *co 
confidence that he and Mr. Baum 
concern that a gas *hyphen* tax 
outrage over the case *comma* sa 
interest in financing offer for 
dismay that a foreign company co 
interest in Mcorp *ap* mvestment 
concern about an Italian firm su 
surprise at Sony *ap* move hut d 
concern about the EC *ap* use of 
disappointment that he vas not i 
support for the idea *period* Ca 
expressing surprise when thieves walk by t 
expressed renewed interest in acquiring th 
expressed no regret for the killings *comm 
express disbelief that Ms. Shere kept on 
express support for the Andean nations w 
expressed its commitment to a free *hyphe 
express interest in the certificates rec 
express~xg concerns *comma* also said the 
expressed a desire to visit China *period 
expressed some confidence that his plan v 
expressed varying degrees of dissatisfact 
express his linguistic doubts to America 
expressing their relief after crossing in 
expressing delight at having an eXCUSe to 
expresses confidence in the outcome of a 
express their zeal on the streets *comma 
express their grief and support *period* 
expresses sympathy for Sen. Riegle ,comma 
express interest in paintings but do *tot 
Figure 4: PREFERRED, EXPRESSED, and (root) EXPRESS collocations m context 
Acr~ DE COLING-92, NAhTES, 23-28 AObq" 1992 1 3 1 0 PROC. OF COLING-92. NANTES, AUO. 23-28, 1992 
curacy a-s high as 90%. A simple calculation 
reveals that a 34-word sentence might contain 
some 1-.2 errors on the average. 
This error rate is too high. Since the pre- 
processor's job is to eliufinate from consider- 
ation possible parse trees, if the appropriate 
parse is eliminated by the pre-processor at the 
outset, it will never be recovered by the parser. 
As shown in this paper, it is now necessary to 
investigate in depth how various linguistic phe- 
nomena are reflected by statistical data. 
x e'~__a, x 
nl|~ 
disappointment 11.9 
skepticism 11.6 57 
optimism 10.8 49 
reservations 10,8 33 
doubt 10.1 6a 
surprise 10.0 69 
satisfaction 10.0 14 
ctmfidence 0.6 67 
shock 8.9 12 
hope 8.8 46 
concern 8.7 318 
worry 8.7 13 
relief 8.6 23 
interest 8.2 224 
support _ -. 7.0 . 
X e'sed e's X I e'ses X &sing X v. ratio 
no no I no no nl n2 r 
2 1 5 6 14 89 .16 
l "2 3 57 .05 
3 1 4 8 49 .16 
3 2 1 6 33 .18 
2 1 5 4 13 63 .20 
1 5 '2 1 9 69 .13 
1 2 3 14 .21 
1 4 ! 1 6 67 .09 
3 1 4 12 .33 
2 1 4 7 46 .15 
30 31 9 25 95 318 .30 
i t 6 3 '2 12 13 .92 
0 23 .00 
4 6 9 l I :10 294 .10 
1 5 3 9 46 .20 
Figure 5:5 Variant Collocations for Express 
References 
R. Beckwith, C. Fellbaum, D. Gross, and 
G. Miller. Wordnet: A lexical database or- 
ganize(\[ on psycholinguistic principles. In 
U. Zernik, editor, Lezical Acquisilion: Ez- 
plotting On-Line Dictiona)T to Build a Lexi- 
con. Lawrence Erlbaum Assoc., Hissdale, N J, 
1991. 
K. Church, W. Gale, P. Banks, and D. tlin- 
die. Parsing, word associations, and predicate- 
argument relations. In Proceedings of the In- 
ternational Workshop on Parsing Technolo- 
gies, Carnegie Mellon University, 1989. 
K. Church, W. Gale, P. Banks, and D. tlin- 
die. Using statistics in lexical analysis. In 
U. Zernik, editor, Lexical Acquisition: Us- 
ing On-Line Resou)~:es to Build a Lexicon. 
Lawrence Erlbanm Associates, Hillsdale, N J, 
1991. 
1. Dagan, A. ltai, and U. Schwall. Two lan- 
guages are more informative than one. In 
Proceedings of the egth Annual Meeting of 
the Association for Computational Linguzs- 
tics, Berkeley, CA, 1991. 
M. Kay. Parsing in Functional Unification 
Gralomax. In D. Dowty, L. Kartunnen, and 
A. Zwicky, editors, Natural Language Parsing: 
Psychological, Computational, and TheonetZ- 
col Perspectives. Cambridge University Press, 
Cambridge, England, 1985. 
S. Silieber. An Introduction to Unification- 
based Approaches to Grammar. Center for 
the Study of Language and Information, Polo 
Alto, California, 1986. 
F. Smadja. Macrocoding the lexicon with co. 
occurrence knowledge. In U. Zernik, editor, 
Lezical Acquisition: Using On-Line Resources 
to Build a Lexicon. Lawrence Erlbaum Asso- 
ciates, Hillsdale, N J, 1991. 
M. q_bmita. Efficient Parsino for Natural Lan- 
guage. Kluwer Academic Publishers, Hing- 
ham, Massachusetts, 1986. 
U. Zernik aJld P. Jacobs. Tagging for learning. 
In COLING 1990, ltelsinki, Finland, 1990. 
U. Zernik, A. Dietsch, and M. Charbonneau. 
hntoolset programmer's manual. Ge-crd tech- 
nical report, Artificial Intelligence Labora- 
tory, Schenectady, NY, 1991. 
A~ DE COLING-92, NANTES, 23-28 Aotrr 1992 1 3 1 I Prtoc. OF COL1NGO2, NANTES, AUG. 23-28, 1992 
