Independence Assumptions Considered Harmful 
Alexander Franz 
Sony Computer Science Laboratory &: D21 Laboratory 
Sony Corporation 
6-7-35 Kitashinagawa 
Shinagawa-ku, Tokyo 141, Japan 
amI©csl, sony. co. jp 
Abstract 
Many current approaches to statistical lan- 
guage modeling rely on independence a.~- 
sumptions 1)etween the different explana- 
tory variables. This results in models 
which are computationally simple, but 
which only model the main effects of the 
explanatory variables oil the response vari- 
able. This paper presents an argmnent in 
favor of a statistical approach that also 
models the interactions between the ex- 
planatory variables. The argument rests 
on empirical evidence from two series of ex- 
periments concerning automatic ambiguity 
resolution. 
1 Introduction 
In this paper, we present an empirical argument in 
favor of a certain approach to statistical natural lan- 
guage modeling: we advocate statistical natural lan- 
guage models that account for the interactions be- 
tween the explanatory statistical variables, rather 
than relying on independence a~ssumptions. Such 
models are able to perform prediction on the basis of 
estimated probability distributions that are properly 
conditioned on the combinations of the individual 
values of the explanatory variables. 
After describing one type of statistical model that 
is particularly well-suited to modeling natural lan- 
guage data, called a loglinear model, we present ein- 
pirical evidence fi'om a series of experiments on dif- 
ferent ambiguity resolution tasks that show that the 
performance of the loglinear models outranks the 
performance of other models described in the lit- 
erature that a~ssume independence between the ex- 
planatory variables. 
2 Statistical Language Modeling 
By "statistical language model", we refer to a mathe- 
matical object that "imitates the properties" of some 
respects of naturM language, and in turn makes pre- 
dictions that are useful from a scientific or engineer- 
ing point of view. Much recent work in this flame- 
work hm~ used written and spoken natural language 
data to estimate parameters for statisticM models 
that were characterized by serious limitations: mod- 
els were either limited to a single explanatory vari- 
able or. if more than one explanatory variable wa~s 
considered, the variables were assumed to be inde- 
pendent. In this section, we describe a method for 
statistical language modeling that transcends these 
limitations. 
2.1 Categorical Data Analysis 
Categorical data analysis is the area of statistics that 
addresses categorical statistical variable: variables 
whose values are one of a set of categories. An exam- 
pie of such a linguistic variable is PART-OF-SPEECH, 
whose possible values might include nou.n, verb, de- 
terminer, preposition, etc. 
We distinguish between a set of explanatory vari- 
ames. and one response variable. A statistical model 
can be used to perforin prediction in the following 
manner: Given the values of the explanatory vari- 
ables, what is the probability distribution for the 
response variable, i.e.. what are the probabilities for 
the different possible values of the response variable? 
2.2 The Contingency Table 
Tile ba,sic tool used in categorical data analysis is 
the contingency table (sometimes called the "cross- 
classified table of counts"). A contingency table is a 
matrix with one dimension for each variable, includ- 
ing the response variable. Each cell ill the contin- 
gency table records the frequency of data with the 
appropriate characteristics. 
Since each cell concerns a specific combination of 
feat.ures, this provides a way to estimate probabil- 
ities of specific feature combinations from the ob- 
served frequencies, ms the cell counts can easily be 
converted to probabilities. Prediction is achieved by 
determining the value of the response variable given 
the values of the explanatory variables. 
182 
2.3 The Loglinear Model 
A loglinear model is a statistical model of the effect 
of a set of categorical variables and their combina- 
tions on the cell counts in a contingency table. It can 
be used to address the problem of sparse data. since 
it can act a.s a "snmothing device, used to obtain 
cell estimates for every cell in a sparse array, even if 
the observed count is zero" (Bishop, Fienberg, and 
Holland. 1975). 
Marginal totals (sums for all values of some vari- 
ables) of the observed counts are used to estimate 
the parameters of the loglinear model; the model in 
turn delivers estimated expected cell counts, which 
are smoother than the original cell counts. 
The mathematical form of a loglinear model is a,s 
follows. Let mi5~ be the expected cell count for cell 
(i.j. k .... ) in the contingency table. The general 
form of a loglinear model is ms follows: 
logm/j~... = u.-{-ltlti).-~lt2(j)-~-U3(k)-~lZl2(ij)-~-.. . (1) 
In this formula, u denotes the mean of the logarithms 
of all the expected counts, u+ul(1) denotes the mean 
of the logarithms of the expected counts with value 
i of the first variable, u + u2(j) denotes the mean of 
the logarithms of the expected counts with value j of 
the second variable, u + ux~_(ii) denotes the mean of 
the logarithms of the expected counts with value i of 
the first veriable and value j of the second variable, 
and so on. 
Thus. the term uzii) denotes the deviation of the 
mean of the expected cell counts with value i of the 
first variable from the grand mean u. Similarly, the 
term Ul2(ij) denotes the deviation of the mean of the 
expected cell counts with value i of the first variable 
and value j of the second variable from the grand 
mean u. In other words, ttl2(ij) represents the com- 
bined effect of the values i and j for the first and 
second variables on the logarithms of the expected 
cell counts. 
In this way, a loglinear model provides a way to 
estimate expected cell counts that depend not only 
on the main effects of the variables, but also on 
the interactions between variables. This is achieved 
by adding "interaction terms" such a.s Ul2(ij ) to the 
nmdel. For further details, see (Fienberg, 1980). 
2.4 The Iterative Estimation Procedure 
For some loglinear models, it is possible to obtain 
closed forms for the expected cell counts. For more 
complicated models, the iterative proportional fitting 
algorithm for hierarchical loglinear models (Denting 
and Stephan, 1940) can be used. Briefly, this proce- 
dure works ms follows. 
Let the values for the expected cell counts that are 
estimated by the model be represented by the sym- 
bol 7hljk .... The interaction terms in the loglinear 
nmdels represent constraints on the estimated ex- 
pected marginal totals. Each of these marginal con- 
straints translates into an adjustment scaling factor 
for the cell entries. The iterative procedure has the 
following steps: 
1. Start with initial estimates for the estimated ex- 
pected cell counts. For example, set all 7hijal = 
1.0. 
2. Adjust each cell entry by multiplying it by the 
scaling factors. This moves the cell entries to- 
wards satisfaction of the marginal constraints 
specified by the nmdel. 
3. Iterate through the adjustment steps until the 
maximum difference e between the marginal 
totals observed in the sample and the esti- 
mated marginal totals reaches a certain mini- 
mum threshold, e.g. e = 0.1. 
After each cycle, the estimates satisfy the con- 
straints specified in the model, and the estimated 
expected marginal totals come closer to matching 
the observed totals. Thus. the process converges. 
This results in Maximum Likelihood estimates for 
both multinomial and independent Poisson sampling 
schemes (Agresti, 1990). 
2.5 Modeling Interactions 
For natural language classification and prediction 
tasks, the aim is to estimate a conditional proba- 
bility distribution P(H\[E) over the possible values 
of the hypothesis H, where the evidence E consists 
of a number of linguistic features el, e2 ..... Much of 
the previous work in this area assumes independence 
between the linguistic features: 
P(/-/le~.ej .... ) ~ P(Hlel) x P(Hlej) x ... (2) 
For example, a model to predict Part-of-Speech of 
a word on the basis of its morphological affix and its 
capitalization might a.ssume independence between 
the two explanatory variables a,s follows: 
P(POSIAFFIX, CAPITALIZATION) ,,~ (3) 
P(POSIAFFIX ) x P(POSICAPITALIZATION ) 
This results ill a considerable computational sim- 
plification of the model but, as we shall see below. 
leads to a considerable loss of information and con- 
comitant decrease in prediction accuracy. With a 
loglinear model, on the other hand. such indepen- 
dence assumptions are not necessary. The loglinear 
model provides a posterior distribution that is prop- 
erly conditioned on the evidence, and maximizing 
the conditional probability P(HIE ) leads to mini- 
mum error rate classification (Duda and Hart. 1973). 
183 
s 
3 Predicting Part-of-Speech 
We will now turn to the empirical evidence support- 
ing the argument against independence assumptions. ~ 
In this section, we will compare two models for pre- e ~ 
dicting the Part-of-Speech of an unknown word: A ~ 
simple model that treats the various explanatory 
variables ms independent, and a model using log- 
linear smoothing of a contingency table that takes 
into account the interactions between the explana- 
tory variables. 
3.1 Constructing the Model 
The model wa~s constructed in the following way. 
First, features that could be used to guess the PUS 
of a word were determined by examining the training 
portion of a text corpus. The initial set of features 
consisted of the following: 
• INCLUDES-NUMBER. Does the word include a 
nunlber? 
• CAPITALIZED. Is the word in sentence-initial po- 
sition and capitalized, in any other position and 
capitalized, or in lower ca~e? 
• INCLUDES-PERIOD. Does the word include a pe- 
riod? 
• INCLUDES-COMMA. Does the word include a 
colnlna? 
• FINAL-PERIOD. Is the last character of the word 
a period? 
• INCLUDES-HYPHEN. Does the word include a 
hyphen? 
• ALL-UPPER-CASE. Is the word in all upper case? 
• SHORT. Is the length of the word three charac- 
ters or less? 
• INFLECTION. Does the word carry one of the 
English inflectional suffixes? 
• PREFIX. Does the word carry one of a list of 
frequently occurring prefixes? 
• SUFFIX. Does the word carry one of a list of 
frequently occurring suffixes? 
Next, exploratory data analysis was perfornled in 
order to determine relevant features and their values, 
and to approximate which features interact. Each 
word of the training data was then turned into a 
feature vector, and the feature vectors were cross- 
classified in a contingency table. The contingency 
table was smoothed using a loglinear models. 
3.2 Data 
Training and evaluation data was obtained from the 
Penn Treebank Brown corpus (Marcus, Santorini, 
and Marcinkiewicz, 1993). The characteristics of 
"'rare" words that might show up ms unknown words 
differ fi'om the characteristics of words in general. 
so a two-step procedure wa~ employed a first time 
Overall Accuracy 
i. __,...,o_ 4 L~hnem¢ F~tgf~ 9 L~llnQ&¢ ~Oatu¢~ 
8 
. 
F=0.4 Set Accuracy 
.... 4 maeo,tnaom Flalu,~ \[ i 4 LOgL'/~III 
~omtur~ j i 9 l.~Jl~ar vulu,u 
Figure 1: Performance of Different Models 
to obtain a set of "'rare" words ms training data, and 
again a second time to obtain a separate set of "'rare*" 
words ms evMuation data. There were 17,000 words 
in the training data, and 21,000 words in the evalua- 
tion data. Ambiguity resolution accuracy was evalu- 
ated for the "'overall accuracy" (Percentage that the 
most likely PUS tag is correct), and "'cutoff factor 
accuracy" (accuracy of the answer set consisting of 
all PUS tags whose probability lies within a factor 
F of the most likely PUS (de Marcken, 1990)). 
3.3 Accuracy Results 
(Weischedel et al., 1993) describe a model for un- 
known words that uses four features, but treats the 
features ms independent. We reimplemented this 
model by using four features: POS, INFLECTION, 
CAPITALIZED, and HYPHENATED, In Figures i 2, 
the results for this model are labeled 4 Indepen- 
dent Features. For comparison, we created a log- 
linear model with the same four features: the results 
for this model are labeled 4 Loglinear Features. 
The highest accuracy was obtained by the log- 
linear model that includes all two-way interac- 
tions and consists of two contingency tM)les with 
the following features: POS, ALL-UPPER-CASE. 
HYPHENATED, INCLUDES-NUMBER, CAPITALIZED, 
INFLECTION, SHORT. PREFIX, and SUFFIX. The re- 
sults for this model are lM)eled 9 Loglinear Fea- 
tures. The parameters for all three unknown word 
models were estimated from the training data. and 
the models were evaluated on the evaluation data. 
The accuracy of the different models in a.ssigning 
the most likely POSs to words is summarized in Fig- 
ure 1. In the left diagram, the two barcharts show 
two different accuracy memsures: Percent correct 
(Overall Accuracy), and percent correct within 
the F=0.4 cutoff factor answer set (F=0.4 Set 
Accuracy). In both cruses, the loglinear model 
with four features obtains higher accuracy than 
the method that assumes independence between the 
same four features. The loglinear model with nine 
184 
o 
o 
o o 
• .-- ......... ~ ...... o- ..... o ...... o ..... 
• -- L°glmea'wlt F~t~e= \] 
1 2 3 4 5 6 7 
N~ol Features 
Figure 2: Effect of Number of Features on Accuracy 
$ 
o 
Uregmm Pro~exe~ kog~r Mce.~ 
Figure 3: Error Rate on Unknown Words 
features further improves this score. 
3.4 Effect of Number of Features on 
Accuracy 
The performance of the loglinear model can be im- 
proved by adding more features, but this is not pos- 
sible with the simpler nmdel that assumes indepen- 
dence between the features. Figure 2 shows the 
performance of the two types of nmdels with fen- 
ture sets that ranged from a single feature to nine 
features. 
As the diagram shows, the accuracies for both 
methods rise with the first few features, but then 
the two methods show a clear divergence. The ac- 
curacy of the simpler method levels off around at 
around 50-55%, while the loglinear model reaches 
an accuracy of 70-75%. This shows that the loglin- 
ear model is able to tolerate redundant features and 
use information from more features than the simpler 
method, and therefore achieves better results at am- 
biguity resolution. 
3.5 Adding Context to the Model 
Next, we added of a stochastic POS tagger (Char- 
niak et al., 1993) to provide a model of context. A 
stochastic POS tagger assigns POS labels to words 
in a sentence by using two parameters: 
• Lexical Probabilities: P(wlt ) -- the proba- 
bility of observing word w given that the tag t 
occurred. 
• Contextual Probabilities: P(ti\[ti-1, t~_2) -- 
the probability of observing tag ti given that the 
two previous tags ti-1, t,i--2 occurred. 
The tagger maximizes the probability of the tag se- 
quence T = t.l,t, 2 .... ,t.,, given the word sequence 
W = wz,w2,... ,w,,, which is approximated a.s fol- 
lows: 
I"L 
P(TIW) ~ II P(wdt~)P(tdt~_~, ti_=) (4) 
i= 1 
The accuracy of the combination of the loglinear 
model for local features and the stochastic POS tag- 
ger for contextual features was evaluated empirically 
by comparing three methods of handling unknown 
words: 
• Unigram: Using the prior probability distri- 
bution P(t) of the POS tags for rare words. 
• ProbabUistic UWM: Using the probabilistic 
model that assumes independence between the 
features. 
• Classifier UWM: Using the loglinear model 
for unknown words. 
Separate sets of training and evaluation data for the 
tagger were obtained from from the Penn Treebank 
Wall Street corpus. Evaluation of the combined sys- 
t.em was performed on different configurations of the 
POS tagger on 30-40 different samples containing 
4,000 words each. 
Since the tagger displays considerable variance in 
its accuracy in assigning POS to unknown words in 
context, we use boxplots to display the results. Fig- 
ure 3 compares the tagging error rate on unknown 
words for the unigram method (left) and the log- 
linear method with nine features (labeled statisti- 
cal classifier) at right. This shows that the Ioglin- 
ear model significantly improves the Part-of-Speech 
tagging accuracy of a stochastic tagger on unknown 
words. The median error rate is lowered consider- 
ably, and samples with error rates over 32% are elim- 
inated entirely. 
185 
o = 
== 
• PmO~¢ UWM 
• Logli~e= UWM 
o u , *=* 
• • • =a 
• o °° 08° 
0 S tO 15 2Q 25 30 35 40 4S 50 SS 60 
Peeclntage ol Unknown WO~= 
Figure 4: Effect of Proportion of Unknown Words 
on Overall Tagging Error Rate 
3.6 Effect of Proportion of Unknown 
Words 
Since most of the lexical ambiguity resolution power 
of stochastic PUS tagging comes from the lexical 
probabilities, unknown words represent a significant 
source of error. Therefore, we investigated the effect 
of different types of models for unknown words on 
the error rate for tagging text with different propor- 
tions of unknown words. 
Samples of text that contained different propor- 
tions of unknown words were tagged using the three 
different methods for handling unknown words de- 
scribed above. The overall tagging error rate in- 
creases significantly as the proportion of new words 
increases. Figure 4 shows a graph of overall tagging 
accuracy versus percentage of unknown words in the 
text. The graph compares the three different meth- 
ods of handling unknown words. The diagram shows 
that the loglinear model leads to better overall tag- 
ging performance than the simpler methods, with a 
clear separation of all samples whose proportion of 
new words is above approximately 10%. 
4 Predicting PP Attachment 
In the second series of experiments, we compare the 
performance of different statistical models on the 
task of predicting Prepositional Phrase (PP) attach- 
ment. 
4.1 Features for PP Attachment 
First, an initial set of linguistic features that could 
be useful for predicting PP attachment was deter- 
mined. The initial set included the following fea- 
tures: 
• PREPOSITION. Possible values of this feature in- 
clude one of the more frequent prepositions in 
the training set, or the value other-prep. 
* VERB-LEVEL. Lexical association strength be- 
tween the verb and the preposition. 
• NOUN-LEVEL. Lexical association strength be- 
tween the noun and the preposition. 
• NOUN-TAG. Part-of-Speech of the nominal at- 
tachment site. This is included to account for 
correlations between attachment and syntactic 
category of the nominal attachment site, such 
as "PPs disfavor attachment to proper nouns." 
• NOUN-DEFINITENESS. Does the nominal attach- 
ment site include a definite determiner? This 
feature is included to account for a possible cor- 
relation between PP attachment to the nom- 
inal site and definiteness, which was derived 
by (Hirst, 1986) from the principle of presup- 
position minimization of (Craln and Steedman, 
1985). 
• PP-OBJECT-TAG. Part-of-speech of the object of 
the PP. Certain types of PP objects favor at- 
tachment to the verbal or nominal site. For ex- 
ample, temporal PPs, such as "in 1959", where 
the prepositional object is tagged CD (cardi- 
nal), favor attachment to the VP, because tile 
VP is more likely to have a temporal dimension. 
The association strengths for VERB-LEVEL and 
NOUN-LEVEL were measured using the Mutual In- 
formation between the noun or verb, and the prepo- 
sition. 1 The probabilities were derived ms Maximum 
Likelihood estimates from all PP cases in the train- 
ing data. The Mutual Information values were or- 
dered by rank. Then, the a~ssociation strengths were 
categorized into eight levels (A-H), depending on 
percentile in the ranked Mutual Information values. 
4.2 Experimental Data and Evaluation 
Training and evaluation data was prepared from the 
Penn treebank. All 1.1 million words of parsed text 
in the Brown Corpus, and 2.6 million words of parsed 
WSJ articles, were used. All instances of PPs that 
are attached to VPs and NPs were extracted. This 
resulted in 82,000 PP cases from the Brown Corpus, 
and 89,000 PP cases from the WS.\] articles. Verbs 
and nouns were lemmatized to their root forms if the 
root forms were attested in the corpus. If the root 
form did not occur in the corpus, then the inflected 
form was used. 
All the PP cases from the Brown Curl)us, and 
50,000 of the WSJ cases, were reserved ms training 
data. The remaining 39,00 WSJ PP cases formed the 
evaluation pool. In each experiment, performance 
IMutu',d Information provides an estimate of the 
magnitude of the ratio t)ctw(.(-n the joint prol)ability 
P(verb/noun,1)reposition), and the joint probability a.~- 
suming indcpendcnce P(verb/noun)P(prcl)osition ) - s(:(, 
(Church and Hanks, 1990). 
186 
o 
1 
| u 
R~m A~jllon Hfr,3~ & Roolh kog~eaw ~ak~r 
1 ! 
o 
o ol 
°t 
I i 
o! 
l l 
o 
Figure 5: Results for Two Attachment Sites Figure 6: Three Attachment Sites: Right Associa- 
tion and Lexical Association 
was evaluated oil a series of 25 random samples of 
100 PP cases fi'om the evaluation pool. in order to 
provide a characterization of the error variance. 
4.3 Experimental Results: Two 
Attachments Sites 
Previous work oll automatic PP attachment disam- 
biguation has only considered the pattern of a verb 
phrase containing an object, and a final PP. This 
lends to two possible attachment sites, the verb and 
the object of the verb. The pattern is usually further 
simplified by considering only the heads of the possi- 
ble attachment sites, corresponding to the sequence 
"Verb Noun1 Preposition Noun2". 
The first set of experiments concerns this pattern. 
There are 53,000 such cases in the training data. and 
16,000 such cases in the evaluation pool. A number 
of methods were evaluated on this pattern accord- 
ing to the 25-sample scheme described above. The 
results are shown in Figure 5. 
4.3.1 Baseline: Right Association 
Prepositional phrases exhibit a tendency to attach 
to the most recent possible attachment site; this is 
referred to ms the principle of "'Right Association". 
For the "V NP PP'" pattern, this means preferring 
attachment to the noun phra~se. On the evaluation 
samples, a median of 65% of the PP cases were at- 
tached to the noun. 
4.3.2 Results of Lexical Association 
(Hindle and R ooth. 1993) described a method for 
obtaining estimates of lexical a.ssociation strengths 
between nouns or verbs and prepositions, and then 
using lexical association strength to predict. PP at- 
tachment. In our reimplementation of this lnethod. 
the probabilities were estimated fi'om all the PP 
cases in the training set. Since our training data 
are bracketed, it was possible to estimate tile lexi- 
cal associations with much less noise than Hindle & 
R ooth, who were working with unparsed text. The 
median accuracy for our reimplementation of Hindle 
& Rooth's method was 81%. This is labeled "Hindle 
& Rooth'" in Figure 5. 
4.3.3 Results of the Loglinear Model 
The loglinear model for this task used the features 
PREPOSITION. VERB-LEVEL, NOUN-LEVEL, and 
NOUN-DEFINITENESS, and it included all second- 
order interaction terms. This model achieved a me- 
dian accuracy of 82%. 
Hindle & Rooth's lexical association strategy only 
uses one feature (lexical aasociation) to predict PP 
attachment, but. ms the boxplot shows, the results 
from the loglinear model for the "V NP PP" pattern 
do not show any significant improvement. 
4.4 Experimental Results: Three 
Attachment Sites 
As suggested by (Gibson and Pearlmutter. 1994), 
PP attachment for the "'Verb NP PP" pattern is 
relatively easy to predict because the two possible 
attachment sites differ in syntactic category, and 
therefore have very different kinds of lexical pref- 
erences. For example, most PPs with of attach to 
nouns, and most PPs with f,o and by attach to verbs. 
In actual texts, there are often more than two possi- 
ble attachment sites for a PP. Thus, a second, more 
realistic series of experiments was perforlned that 
investigated different PP attachment strategies for 
the pattern "'Verb Noun1 Noun2 Preposition Noun3"' 
that includes more than two possible attachment 
sites that are not syntactically heterogeneous. There 
were 28,000 such cases in the training data. and 8000 
ca,~es in the evaluation pool. 
187 
"5 o 
RIgN AUCCUII~ Split HinOle & Rooln Lo~l~ur M0~el 
Figure 7: Summary of Results for Three Attachment 
Sites 
4.4.1 Baseline: Right Association 
As in the first set of experiments, a number of 
methods were evaluated an the three attachment site 
pattern with 25 samples of 100 random PP cases. 
The results are shown in Figures 6-7. The baseline 
is again provided by attachment according to the 
principle of "Right Attachment'; to the nmst recent 
possible site, i.e. attaclunent to Noun2. A median 
of 69% of the PP cases were attached to Noun2. 
4.4.2 Results of Lexical Association 
Next, the lexical association method was evalu- 
ated on this pattern. First. the method described 
by Hindle & Rooth was reimplemented by using the 
lexical association strengths estimated from all PP 
cases. The results for this strategy are labeled "Basic 
Lexical Association" in Figure 6. This method only 
achieved a median accuracy of 59%, which is worse 
than always choosing the rightmost attachment site. 
These results suggest that Hindle & R.ooth's scoring 
function worked well in the "'Verb Noun1 Preposi- 
tion Noun2"' case not only because it was an accurate 
estimator of lexical associations between individual 
verbs/nouns and prepositions which determine PP 
attachment, but also because it accurately predicted 
the general verb-noun skew of prepositions. 
4.4.3 Results of Enhanced Lexical 
Association 
It seems natural that this pattern calls for a com- 
bination of a structural feature with lexical associa- 
tion strength. To implement this, we modified Hin- 
dle & Rooth's method to estimate attachments to 
the verb, first noun. and second noun separately. 
This resulted in estimates that combine the struc- 
tural feature directly with the lexical association 
strength. The modified method performed better 
than the original lexical association scoring function, 
but it still only obtained a median accuracy of 72%. 
This is labeled "Split Hindle & Rooth" in Figure 7. 
4.4.4 Results of Loglinear Model 
To create a model that combines various 
structural and lexical features without indepen- 
dence assumptions, we implemented a loglinear 
model that includes the variables VERB-LEVEL 
FIRST-NOUN-LEVEL. and SECOND-NOUN-LEVEL. 2 
The loglinear model also includes the variables 
PREPOSITION and PP-OBJECT-TAG. It, was 
smoothed with a loglinear model that includes all 
second-order interactions. 
This method obtained a median accuracy of 79%; 
this is labeled "Loglinear Model" in Figure 7. As the 
boxplot shows, it performs significantly better than 
the methods that only use estimates of lexical a,~so- 
clarion. Compared with the "'Split Hindle Sz Rooth'" 
method, the samples are a little less spread out, and 
there is no overlap at all between the central 50% of 
the samples from the two methods. 
4.5 Discussion 
The simpler "V NP PP" pattern with two syntacti- 
cally different attachment sites yielded a null result: 
The loglinear method did not perform significantly 
better than the lexical association method. This 
could mean that the results of the lexical associa- 
tion method can not be improved by adding other 
features, but it is also possible that the features that 
could result in improved accuracy were not identi- 
fied. 
The lexical association strategy does not perform 
well on the more difficult pattern with three possible 
attachment sites. The loglinear model, on the other 
hand, predicts attachment with significantly higher 
accuracy, achieving a clear separation of the central 
50% of the evaluation samples. 
5 Conclusions 
We have contrasted two types of statistical language 
models: A model that derives a probability distribu- 
tion over the response variable that is properly con- 
ditioned on the combination of the explanatory vari- 
able, and a simpler model that treats the explana- 
tory variables as independent, and therefore models 
the response variable simply a~s the addition of the 
individual main effects of the explanatory variables. 
2These features use tile s~unc Mutual Information- 
ba.~ed measure of lcxic',d a.sso(:iation a.s tim prc.vious log- 
linear model for two possibh~" attachment sites, which 
wcrc estimated from all nomin'M azt(l vcrhal PP att~t(:h- 
ments in the corpus. The features FIRST-NOUN-LEVEL 
aaM SECOND-NOUN-LEVEL use the same estimates: in 
other words, in contrm~t to the "split Lexi(:al Associa- 
tion" method, they were not estimated sepaxatcly for 
the two different nominaJ, attachment sites. 
188 
The experimental results show that, with the same 
feature set, inodeling feature interactions yields bet- 
ter performance: such nmdels achieves higher accu- 
racy, and its accura~,y can be raised with additional 
features. It is interesting to note that modeling vari- 
able interactions yields a higher perforlnanee gain 
than including additional explanatory variables. 
While these results do not prove that modeling 
feature interactions is necessary, we believe that they 
provide a strong indication. This suggests a mlmber 
of avenues for filrther research. 
First, we could attempt to improve the specific 
models that were presented by incorporating addi- 
tional features, and perhal)S by taking into account 
higher-order features. This might help to address 
the performance gap between our models and hu- 
man subjects that ha,s been documented in the lit- 
erature, z A more ambitious idea would be to use a 
statistical model to rank overall parse quality for en- 
tire sentences. This would be an improvement over 
schemes that a,ssnlne independence between a num- 
ber of individual scoring fimctions, such ms (Alshawi 
and Carter, 1994). If such a model were to include 
only a few general variables to account for such fea- 
tures a.~ lexical a.ssociation and recency preference 
for syntactic attachment, it might even be worth- 
while to investigate it a.s an approximation to the 
human parsing mechanism. 

References 
Agresti, Alan. 1990. Categorical Data Analysis. John Wiley & Sons, New York. 
Alshawi, Hiyan and David Carter. 1994. Training 
and scaling preference functions for disambigua- 
tion. Computational Linguistics, 20(4):635-648. 
Bishop. Y. M., S. E. Fienberg, and P. W. Holland. 
1975. Discrete Multivariate Analysis: Th, eory and 
Practice. MIT Press, Cambridge, MA. 
Charniak, Eugene, Curtis Hendrickson, Neil ,Jacob- 
son, and Mike Perkowitz. 1993. Equations for 
part-of-speech tagging. In AAAI-93, pages 784~ 
789. 
Church, Kenneth W. and Patrick Hanks. 1990. 
Word a,~soeiation norms, mutual information, 
and lexicography. Computational Linguistics, 
16(1):22-29. 
Crain, Stephen and Mark 3. Steedman. 1985. On 
not being led up the garden path: The use of 
3For cXaml)l(', If random s(;ntcnc(;s with "V('rb NP 
PP" (:~(:s from th(: Penn tr(',(;l)ank aa'(: tak(:n ms the gohl 
standard, then (Hindlc and Rooth, 1993) and (Ratna- 
l)arkhi, Ryn~r, aal(t Roukos. 1994) rcl)ort that human, 
(:xi)(;rts using only hca(t words obtain 85%-88% a('cu- 
ra~:y. If the huma~l CXl)erts arc allow(:d to consult the 
whoh," scntcn(:(:, their accuracy judged against random 
Trc(}l)ank s(',ntclm(:s rises to al)l)roximatcly 93%. 
context by the psychological syntax processor. 
In David R. Dowty, Lauri Karttunen, and An- 
rnold M. Zwicky, editors, Natural Language Pars- 
ing, pages 320-358, Cambridge, UK. Cambridge 
University Press. 
de Marcken, Carl G. 1990. Parsing the LOB corpus. 
In Proceedings of A CL-90, pages 243-251. 
Deming, W. E. and F. F. Stephan. 1940. On a lea.st 
squares adjustment of a sampled frequency ta- 
ble when the expected marginal totals are known. 
Ann. Math. Statis, (11):427--444. 
Duda, Richard O. and Peter E. Hart. 1973. Pattern 
Classification and Scene Analysis. John Wiley & 
Sons, New York. 
Fienberg, Stephen E. 1980. Th.e Analysis of Cross- 
Classified Categorical Data. The MIT Press, 
Cambridge, MA, second edition edition. 
Franz, Alexander. 1996. Automatic Ambiguity Res- 
olution in Natural Language Processing. volume 
1171 of Lecture Notes in Artificial Intelligence. 
Springer Verlag, Berlin. 
Gibson, Ted and Neal Pearhnutter. 1994. A corpus- 
ba,sed analysis of psycholinguistic constraints on 
PP attachment. In Charles Clifton Jr., Lyn 
Frazier, and Keith Rayner, editors, Perspectives 
on Sentence Processing. Lawrence Erlbaum Asso- 
ciates. 
Hindle, Donald and Mats Rooth. 1993. Structural 
ambiguity and lexical relations. Computational 
Linguistics, 19( 1 ): 103-120. 
Hirst, Graeme. 1986. Semantic Interpretation and 
the Resolution of Ambiguity. Cambridge Univer- 
sity Press, Cambridge. 
Marcus, Mitchell P., Beatrice Santorini, and 
Mary Ann Marcinkiewicz. 1993. Building a large 
annotated corpus of English: The Penn Treebank. 
Computational Linguistics, 19(2):313-330. 
Ratnaparkhi, Adwait, Jeff B ynar, and Salim 
Roukos. 1994. A maximum entropy model 
for Prepositional Phra,se attachment. In ARPA 
Workshop on Human Language Technology. 
Plainsboro, N.\], March 8-11. 
Weischedel, Ralph, Marie Meteer, Richard Schwartz, 
Lance Ramshaw, and Jeff Palmucci. 1993. Cop- 
ing with ambiguity and unknown words through 
probabilistic models. Computational Linguistics, 
19(2):359-382. 
