Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing (EMNLP 2006), pages 27–34,
Sydney, July 2006. c©2006 Association for Computational Linguistics
Detecting Parser Errors Using Web-based Semantic Filters
Alexander Yates Stefan Schoenmackers
University of Washington
Computer Science and Engineering
Box 352350
Seattle, WA 98195-2350
{ayates, stef, etzioni}@cs.washington.edu
Oren Etzioni
Abstract
NLP systems for tasks such as question
answering and information extraction typ-
ically rely on statistical parsers. But the ef-
ficacy of such parsers can be surprisingly
low, particularly for sentences drawn from
heterogeneous corpora such as the Web.
We have observed that incorrect parses of-
ten result in wildly implausible semantic
interpretations of sentences, which can be
detected automatically using semantic in-
formation obtained from the Web.
Based on this observation, we introduce
Web-based semantic filtering—a novel,
domain-independent method for automat-
ically detecting and discarding incorrect
parses. We measure the effectiveness of
our filtering system, called WOODWARD,
on two test collections. On a set of TREC
questions, it reduces error by 67%. On
a set of more complex Penn Treebank
sentences, the reduction in error rate was
20%.
1 Introduction
Semantic processing of text in applications such
as question answering or information extraction
frequently relies on statistical parsers. Unfortu-
nately, the efficacy of state-of-the-art parsers can
be disappointingly low. For example, we found
that the Collins parser correctly parsed just 42%
of the list and factoid questions from TREC 2004
(that is, 42% of the parses had 100% precision and
100% recall on labeled constituents). Similarly,
this parser produced 45% correct parses on a sub-
set of 100 sentences from section 23 of the Penn
Treebank.
Although statistical parsers continue to improve
their efficacy over time, progress is slow, par-
ticularly for Web applications where training the
parsers on a “representative” corpus of hand-
tagged sentences is not an option. Because of the
heterogeneous nature of text on the Web, such a
corpus would be exceedingly difficult to generate.
In response, this paper investigates the possibil-
ity of detecting parser errors by using semantic in-
formation obtained from the Web. Our fundamen-
tal hypothesis is that incorrect parses often result
in wildly implausible semantic interpretations of
sentences, which can be detected automatically in
certain circumstances. Consider, for example, the
following sentence from the Wall Street Journal:
“That compares with per-share earnings from con-
tinuing operations of 69 cents.” The Collins parser
yields a parse that attaches “of 69 cents” to “op-
erations,” rather than “earnings.” By computing
the mutual information between “operations” and
“cents” on the Web, we can detect that this attach-
ment is unlikely to be correct.
Our WOODWARD system detects parser errors
as follows. First, it maps the tree produced by a
parser to a relational conjunction (RC), a logic-
based representation language that we describe in
Section 2.1. Second, WOODWARD employs four
distinct methods for analyzing whether a conjunct
in the RC is likely to be “reasonable” as described
in Section 2.
Our approach makes several assumptions. First,
if the sentence is absurd to begin with, then a cor-
rect parse could be deemed incorrect. Second, we
require a corpus whose content overlaps at least in
part with the content of the sentences to be parsed.
Otherwise, much of our semantic analysis is im-
possible.
In applications such as Web-based question an-
swering, these assumptions are quite natural. The
27
questions are about topics that are covered exten-
sively on the Web, and we can assume that most
questions link verbs to nouns in reasonable com-
binations. Likewise, when using parsing for infor-
mation extraction, we would expect our assump-
tions to hold as well.
Our contributions are as follows:
1. We introduce Web-based semantic filtering—
a novel, domain-independent method for de-
tecting and discarding incorrect parses.
2. We describe four techniques for analyzing
relational conjuncts using semantic informa-
tion obtained from the Web, and assess their
efficacy both separately and in combination.
3. We find that WOODWARD can filter good
parses from bad on TREC 2004 questions for
a reduction of 67% in error rate. On a harder
set of sentences from the Penn Treebank, the
reduction in error rate is 20%.
The remainder of this paper is organized as fol-
lows. We give an overview of related work in Sec-
tion 1.1. Section 2 describes semantic filtering, in-
cluding our RC representation and the four Web-
based filters that constitute the WOODWARD sys-
tem. Section 3 presents our experiments and re-
sults, and section 4 concludes and gives ideas for
future work.
1.1 Related Work
The problem of detecting parse errors is most sim-
ilar to the idea of parse reranking. Collins (2000)
describes statistical techniques for reranking alter-
native parses for a sentence. Implicitly, a rerank-
ing method detects parser errors, in that if the
reranking method picks a new parse over the orig-
inal one, it is classifying the original one as less
likely to be correct. Collins uses syntactic and lex-
ical features and trains on the Penn Treebank; in
contrast, WOODWARD uses semantic features de-
rived from the web. See section 3 for a comparison
of our results with Collins’.
Several systems produce a semantic interpreta-
tion of a sentence on top of a parser. For example,
Bos et al. (2004) build semantic representations
from the parse derivations of a CCG parser, and
the English Resource Grammar (ERG) (Toutanova
et al., 2005) provides a semantic representation us-
ing minimal recursion semantics. Toutanova et al.
also include semantic features in their parse se-
lection mechanism, although it is mostly syntax-
driven. The ERG is a hand-built grammar and thus
does not have the same coverage as the grammar
we use. We also use the semantic interpretations
in a novel way, checking them against semantic
information on the Web to decide if they are plau-
sible.
NLP literature is replete with examples of sys-
tems that produce semantic interpretations and
use semantics to improve understanding. Sev-
eral systems in the 1970s and 1980s used hand-
built augmented transition networks or semantic
networks to prune bad semantic interpretations.
More recently, people have tried incorporating
large lexical and semantic resources like WordNet,
FrameNet, and PropBank into the disambiguation
process. Allen (1995) provides an overview of
some of this work and contains many references.
Our work focuses on using statistical techniques
over large corpora, reducing the need for hand-
built resources and making the system more robust
to changes in domain.
Numerous systems, including Question-
Answering systems like MULDER (Kwok et
al., 2001), PiQASso (Attardi et al., 2001), and
Moldovan et al.’s QA system (2003), use parsing
technology as a key component in their analysis
of sentences. In part to overcome incorrect parses,
Moldovan et al.’s QA system requires a complex
set of relaxation techniques. These systems
would greatly benefit from knowing when parses
are correct or incorrect. Our system is the first
to suggest using the output of a QA system to
classify the input parse as good or bad.
Several researchers have used pointwise mu-
tual information (PMI) over the Web to help make
syntactic and semantic judgments in NLP tasks.
Volk (2001) uses PMI to resolve preposition at-
tachments in German. Lapata and Keller (2005)
use web counts to resolve preposition attachments,
compound noun interpretation, and noun count-
ability detection, among other things. And Mark-
ert et al. (2003) use PMI to resolve certain types of
anaphora. We use PMI as just one of several tech-
niques for acquiring information from the Web.
2 Semantic Filtering
This section describes semantic filtering as imple-
mented in the WOODWARD system. WOODWARD
consists of two components: a semantic interpreter
that takes a parse tree and converts it to a conjunc-
tion of first-order predicates, and a sequence of
four increasingly sophisticated methods that check
semantic plausibility of conjuncts on the Web. Be-
low, we describe each component in turn.
28
1. What(NP1) ∧ are(VP1, NP1, NP2) ∧ states(NP2) ∧ producing(VP2, NP2, NP3) ∧ oil(NP3) ∧ in(PP1, NP2, U.S.)
2. What(NP1) ∧ states(NP2) ∧ producing(VP1, NP3, NP2, NP1) ∧ oil(NP3) ∧ in(PP1, NP2, U.S.)
Figure 2: Example relational conjunctions. The first RC is the correct one for the sentence “What are oil producing
states in the U.S.?” The second is the RC derived from the Collins parse in Figure 1. Differences between the two RCs
appear in bold.

	 



		 
	 










 
 
 

Figure 1: An incorrect Collins Parse of a TREC ques-
tion. The parser treats “producing” as the main verb in
the clause, rather than “are”.
2.1 Semantic Interpreter
The semantic interpreter aims to make explicit the
relations that a sentence introduces, and the argu-
ments to each of those relations. More specifically,
the interpreter identifies the main verb relations,
preposition relations, and semantic type relations
in a sentence; identifies the number of arguments
to each relation; and ensures that for every ar-
gument that two relations share in the sentence,
they share a variable in the logical representation.
Given a sentence and a Penn-Treebank-style parse
of that sentence, the interpreter outputs a conjunc-
tion of First-Order Logic predicates. We call this
representation a relational conjunction (RC). Each
relation in an RC consists of a relation name and
a tuple of variables and string constants represent-
ing the arguments of the relation. As an example,
Figure 1 contains a sentence taken from the TREC
2003 corpus, parsed by the Collins parser. Fig-
ure 2 shows the correct RC for this sentence and
the RC derived automatically from the incorrect
parse.
Due to space constraints, we omit details about
the algorithm for converting a parse into an RC,
but Moldovan et al. (2003) describe a method sim-
ilar to ours.
2.2 Semantic Filters
Given the RC representation of a parsed sentence
as supplied by the Semantic Interpreter, we test the
parse using four web-based methods. Fundamen-
tally, the methods all share the underlying princi-
ple that some form of co-occurrence of terms in
the vast Web corpus can help decide whether a
proposed relationship is semantically plausible.
Traditional statistical parsers also use co-
occurrence of lexical heads as features for making
parse decisions. We expand on this idea in two
ways: first, we use a corpus several orders of mag-
nitude larger than the tagged corpora traditionally
used to train statistical parses, so that the funda-
mental problem of data sparseness is ameliorated.
Second, we search for targeted patterns of words
to help judge specific properties, like the number
of complements to a verb. We now describe each
of our techniques in more detail.
2.3 A PMI-Based Filter
A number of authors have demonstrated important
ways in which search engines can be used to un-
cover semantic relationships, especially Turney’s
notion of pointwise mutual information (PMI)
based on search-engine hits counts (Turney, 2001).
WOODWARD’s PMI-Based Filter (PBF) uses PMI
scores as features in a learned filter for predicates.
Following Turney, we use the formula below for
the PMI between two terms t1 and t2:
PMI(t1,t2) = log
parenleftbigg P(t1 ∧ t2)
P(t1)P(t2)
parenrightbigg
(1)
We use PMI scores to judge the semantic plau-
sibility of an RC conjunct as follows. We con-
struct a number of different phrases, which we call
discriminator phrases, from the name of the rela-
tion and the head words of each argument. For
example, the prepositional attachment “operations
of 65 cents” would yield phrases like “operations
of” and “operations of * cents”. (The ‘*’ char-
acter is a wildcard in the Google interface; it can
match any single word.) We then collect hitcounts
for each discriminator phrase, as well as for the
relation name and each argument head word, and
compute a PMI score for each phrase, using the
phrase’s hitcount as the numerator in Equation 1.
29
Given a set of such PMI scores for a single rela-
tion, we apply a learned classifier to decide if the
PMI scores justify calling the relation implausible.
This classifier (as well as all of our other ones)
is trained on a set of sentences from TREC and
the Penn Treebank; our training and test sets are
described in more detail in section 3. We parsed
each sentence automatically using Daniel Bikel’s
implementation of the Collins parsing model,1
trained on sections 2–21 of the Penn Treebank,
and then applied our semantic interpreter algo-
rithm to come up with a set of relations. We la-
beled each relation by hand for correctness. Cor-
rect relations are positive examples for our clas-
sifier, incorrect relations are negative examples
(and likewise for all of our other classifiers). We
used the LIBSVM software package2 to learn a
Gaussian-kernel support vector machine model
from the PMI scores collected for these relations.
We can then use the classifier to predict if a rela-
tion is correct or not depending on the various PMI
scores we have collected.
Because we require different discriminator
phrases for preposition relations and verb rela-
tions, we actually learn two different models.
After extensive experimentation, optimizing for
training set accuracy using leave-one-out cross-
validation, we ended up using only two patterns
for verbs: “noun verb” (“verb noun” for non-
subjects) and “noun * verb” (“verb * noun” for
non-subjects). We use the PMI scores from the
argument whose PMI values add up to the lowest
value as the features for a verb relation, with the
intuition being that the relation is correct only if
every argument to it is valid.
For prepositions, we use a larger set of patterns.
Letting arg1 and arg2 denote the head words of
the two arguments to a preposition, and letting
prep denote the preposition itself, we used the pat-
terns “arg1 prep”, “arg1 prep * arg2”, “arg1
prep the arg2”, ”arg1 * arg2”, and, for verb at-
tachments, “arg1 it prep arg2” and “arg1 them
prep arg2”. These last two patterns are helpful for
preposition attachments to strictly transitive verbs.
2.4 The Verb Arity Sampling Test
In our training set from the Penn Treebank, 13%
of the time the Collins parser chooses too many or
too few arguments to a verb. In this case, checking
the PMI between the verb and each argument in-
dependently is insufficient, and there is not enough
1http://www.cis.upenn.edu/∼dbikel/software.html
2http://www.csie.ntu.edu.tw/∼cjlin/libsvm/
data to find hitcounts for the verb and all of its ar-
guments at once. We therefore use a different type
of filter in order to detect these errors, which we
call the Verb Arity Sampling Test (VAST).
Instead of testing a verb to see if it can take a
particular argument, we test if it can take a certain
number of arguments. The verb predicate produc-
ing(VP1, NP3, NP2, NP1) in interpretation 2 of
Figure 2, for example, has too many arguments.
To check if this predicate can actually take three
noun phrase arguments, we can construct a com-
mon phrase containing the verb, with the property
that if the verb can take three NP arguments, the
phrase will often be followed by a NP in text, and
vice versa. An example of such a phrase is “which
it is producing.” Since “which” and “it” are so
common, this phrase will appear many times on
the Web. Furthermore, for verbs like “produc-
ing,” there will be very few sentences in which
this phrase is followed by a NP (mostly temporal
noun phrases like “next week”). But for verbs like
“give” or “name,” which can accept three noun
phrase arguments, there will be significantly more
sentences where the phrase is followed by a NP.
The VAST algorithm is built upon this obser-
vation. For a given verb phrase, VAST first counts
the number of noun phrase arguments. The Collins
parser also marks clause arguments as being es-
sential by annotating them differently. VAST
counts these as well, and considers the sum of the
noun and clause arguments as the number of es-
sential arguments. If the verb is passive and the
number of essential arguments is one, or if the verb
is active and the number of essential arguments
is two, VAST performs no check. We call these
strictly transitive verb relations. If the verb is pas-
sive and there are two essential arguments, or if the
verb is active and there are three, it performs the
ditransitive check below. If the verb is active and
there is one essential argument, it does the intran-
sitive check described below. We call these two
cases collectively nontransitive verb relations. In
both cases, the checks produce a single real-valued
score, and we use a linear kernel SVM to iden-
tify an appropriate threshold such that predicates
above the threshold have the correct arity.
The ditransitive check begins by querying
Google for two hundred documents containing the
phrase “which it verb” or “which they verb”. It
downloads each document and identifies the sen-
tences containing the phrase. It then POS-tags and
NP-chunks the sentences using a maximum en-
tropy tagger and chunker. It filters out any sen-
30
tences for which the word “which” is preceded by
a preposition. Finally, if there are enough sen-
tences remaining (more than ten), it counts the
number of sentences in which the verb is directly
followed by a noun phrase chunk, which we call an
extraction. It then calculates the ditransitive score
for verb v as the ratio of the number of extractions
E to the number of filtered sentences F:
ditransitiveScore(v) = EF (2)
The intransitive check performs a very similar
set of operations. It fetches up to two hundred
sentences matching the phrases “but it verb” or
“but they verb”, tags and chunks them, and ex-
tracts noun phrases that directly follow the verb.
It calculates the intransitive score for verb v using
the number of extractions E and sentences S as:
intransitiveScore(v) = 1 − ES (3)
2.5 TextRunner Filter
TextRunner is a new kind of web search engine.
Its design is described in detail elsewhere (Ca-
farella et al., 2006), but we utilize its capabil-
ities in WOODWARD. TextRunner provides a
search interface to a set of over a billion triples
of the form (object string, predicate string, ob-
ject string) that have been extracted automatically
from approximately 90 million documents to date.
The search interface takes queries of the form
(string1,string2,string3), and returns all tu-
ples for which each of the three tuple strings con-
tains the corresponding query string as a substring.
TextRunner’s object strings are very similar to
the standard notion of a noun phrase chunk. The
notion of a predicate string, on the other hand, is
loose in TextRunner; a variety of POS sequences
will match the patterns for an extracted relation.
For example, a search for tuples with a predicate
containing the word ‘with’ will yield the tuple
(risks, associated with dealing with, waste wood),
among thousands of others.
TextRunner embodies a trade-off with the PMI
method for checking the validity of a relation. Its
structure provides a much more natural search for
the purpose of verifying a semantic relationship,
since it has already arranged Web text into pred-
icates and arguments. It is also much faster than
querying a search engine like Google, both be-
cause we have local access to it and because com-
mercial search engines tightly limit the number
of queries an application may issue per day. On
the other hand, the TextRunner index is at present
still about two orders of magnitude smaller than
Google’s search index, due to limited hardware.
The TextRunner semantic filter checks the va-
lidity of an RC conjunct in a natural way: it asks
TextRunner for the number of tuples that match
the argument heads and relation name of the con-
junct being checked. Since TextRunner predicates
only have two arguments, we break the conjunct
into trigrams and bigrams of head words, and av-
erage over the hitcounts for each. For predicate
P(A1,... ,An) with n ≥ 2, the score becomes
TextRunnerScore =
1
n − 1
nsummationdisplay
i=2
hits(A1,P,Ai)
+ 1n(hits(A1,P,) +
nsummationdisplay
i=2
hits(,P,Ai))
As with PBF, we learn a threshold for good predi-
cates using the LIBSVM package.
2.6 Question Answering Filter
When parsing questions, an additional method of
detecting incorrect parses becomes available: use
a question answering (QA) system to find answers.
If a QA system using the parse can find an answer
to the question, then the question was probably
parsed correctly.
To test this theory, we implemented a
lightweight, simple, and fast QA system that di-
rectly mirrors the semantic interpretation. It re-
lies on TextRunner and KnowItNow (Cafarella et
al., 2005) to quickly find possible answers, given
the relational conjunction (RC) of the question.
KnowItNow is a state of the art Information Ex-
traction system that uses a set of domain inde-
pendent patterns to efficiently find hyponyms of
a class.
We formalize the process as follows: define a
question as a set of variables Xi corresponding to
noun phrases, a set of noun type predicates Ti(Xi),
and a set of relational predicates Pi(Xi1,...,Xik)
which relate one or more variables and constants.
The conjunction of type and relational predicates
is precisely the RC.
We define an answer as a set of values for each
variable that satisfies all types and predicates
ans(x1,...,xn) = logicalanddisplay
i
Ti(xi) ∧ logicalanddisplay
j
Pj(xj1,...,xjk)
The algorithm is as follows:
1. Compute the RC of the question sentence.
31
2. ∀i find instances of the class Ti for possible
values for Xi, using KnowItNow.
3. ∀j find instances of the relation predicate
Pj(xj1,...,xjk). We use TextRunner to ef-
ficiently find objects that are related by the
predicate Pj.
4. Return all tuples that satisfy ans(x1,...,xn)
The QA semantic filter runs the Question An-
swering algorithm described above. If the number
of returned answers is above a threshold (1 in our
case), it indicates the question has been parsed cor-
rectly. Otherwise, it indicates an incorrect parse.
This differs from the TextRunner semantic filter in
that it tries to find subclasses and instances, rather
than just argument heads.
2.7 The WOODWARD Filter
Each of the above semantic filters has its strengths
and weaknesses. On our training data, TextRunner
had the most success of any of the methods on
classifying verb relations that did not have arity er-
rors. Because of sparse data problems, however, it
was less successful than PMI on preposition rela-
tions. The QA system had the interesting property
that when it predicted an interpretation was cor-
rect, it was always right; however, when it made a
negative prediction, its results were mixed.
WOODWARD combines the four semantic filters
in a way that draws on each of their strengths.
First, it checks if the sentence is a question that
does not contain prepositions. If so, it runs the
QA module, and returns true if that module does.
After trying the QA module, WOODWARD
checks each predicate in turn. If the predicate
is a preposition relation, it uses PBF to classify
it. For nontransitive verb relations, it uses VAST.
For strictly transitive verb relations, it uses Text-
Runner. WOODWARD accepts the RC if every re-
lation is predicted to be correct; otherwise, it re-
jects it.
3 Experiments
In our experiments we tested the ability of WOOD-
WARD to detect bad parses. Our experiments pro-
ceeded as follows: we parsed a set of sentences,
ran the semantic interpreter on them, and labeled
each parse and each relation in the resulting RCs
for correctness. We then extracted all of the nec-
essary information from the Web and TextRunner.
We divided the sentences into a training and test
set, and trained the filters on the labeled RCs from
the training sentences. Finally, we ran each of the
filters and WOODWARD on the test set to predict
which parses were correct. We report the results
below, but first we describe our datasets and tools
in more detail.
3.1 Datasets and Tools
Because question-answering is a key application,
we began with data from the TREC question-
answering track. We split the data into a train-
ing set of 61 questions (all of the TREC 2002 and
TREC 2003 questions), and a test set of 55 ques-
tions (all list and factoid questions from TREC
2004). We preprocessed the questions to remove
parentheticals (this affected 3 training questions
and 1 test question). We removed 12 test questions
because the Collins parser did not parse them as
questions,3 and that error was too easy to detect.
25 training questions had the same error, but we
left them in to provide more training data.
We used the Penn Treebank as our second data
set. Training sentences were taken from section
22, and test sentences from section 23. Because
PBF is time-consuming, we took a subset of 100
sentences from each section to expedite our exper-
iments. We extracted from each section the first
100 sentences that did not contain conjunctions,
and for which all of the errors, if any, were con-
tained in preposition and verb relations.
For our parser, we used Bikel’s implementation
of the Collins parsing model, trained on sections
2-21 of the Penn Treebank. We only use the top-
ranked parse for each sentence. For the TREC
data only, we first POS-tagged each question using
Ratnaparkhi’s MXPOST tagger. We judged each
of the TREC parses manually for correctness, but
scored the Treebank parses automatically.
3.2 Results and Discussion
Our semantic interpreter was able to produce the
appropriate RC for every parsed sentence in our
data sets, except for a few minor cases. Two id-
iomatic expressions in the WSJ caused the seman-
tic interpreter to find noun phrases outside of a
clause to fill gaps that were not actually there. And
in several sentences with infinitive phrases, the se-
mantic interpreter did not find the extracted sub-
ject of the infinitive expression. It turned out that
none of these mistakes caused the filters to reject
correct parses, so we were satisfied that our results
mainly reflect the performance of the filters, rather
than the interpreter.
3That is, the root node was neither SBARQ nor SQ.
32
Relation Type num. correct num. incorrect PBF acc. VAST acc. TextRunner acc.
Nontrans. Verb 41 35 0.54 0.66 0.52
Other Verb 126 68 0.72 N/A 0.73
Preposition 183 58 0.73 N/A 0.76
Table 1: Accuracy of the filters on three relation types in the TREC 2004 questions and WSJ data.
Baseline WOODWARD
sents. parser eff. filter prec. filter rec. F1 filter prec. filter rec. F1 red. err.
trec 43 54% 0.54 1.0 0.70 0.82 1.0 0.90 67%
wsj 100 45% 0.45 1.0 0.62 0.58 0.88 0.70 20%
Table 2: Performance of WOODWARD on different data sets. Parser efficacy reports the percentage of sentences that
the Collins parser parsed correctly. See the text for a discussion of our baseline and the precision and recall metrics. We
weight precision and recall equally in calculating F1. Reduction in error rate (red. err.) reports the relative decrease in
error (error calculated as 1 − F1) over baseline.
In Table 1 we report the accuracy of our first
three filters on the task of predicting whether a re-
lation in an RC is correct. We break these results
down into three categories for the three types of
relations we built filters for: strictly transitive verb
relations, nontransitive verb relations, and prepo-
sition relations. Since the QA filter works at the
level of an entire RC, rather than a single relation,
it does not apply here. These results show that the
trends on the training data mostly held true: VAST
was quite effective at verb arity errors, and Text-
Runner narrowly beat PBF on the remaining verb
errors. However, on our training data PBF nar-
rowly beat TextRunner on preposition errors, and
the reverse was true on our test data.
Our QA filter predicts whether a full parse is
correct with an accuracy of 0.76 on the 17 TREC
2004 questions that had no prepositions. The
Collins parser achieves the same level of accuracy
on these sentences, so the main benefit of the QA
filter for WOODWARD is that it never misclassi-
fies an incorrect parse as a correct one, as was ob-
served on the training set. This property allows
WOODWARD to correctly predict a parse is correct
whenever it passes the QA filter.
Classification accuracy is important for good
performance, and we report it to show how effec-
tive each of WOODWARD’s components is. How-
ever, it fails to capture the whole story of a filter’s
performance. Consider a filter that simply predicts
that every sentence is incorrectly parsed: it would
have an overall accuracy of 55% on our WSJ cor-
pus, not too much worse than WOODWARD’s clas-
sification accuracy of 66% on this data. However,
such a filter would be useless because it filters out
every correctly parsed sentence.
Let the filtered set be the set of sentences that a
filter predicts to be correctly parsed. The perfor-
mance of a filter is better captured by two quanti-
ties related to the filtered set: first, how “pure” the
filtered set is, or how many good parses it contains
compared to bad parses; and second, how waste-
ful the filter is in terms of losing good parses from
the original set. We measure these two quantities
using metrics we call filter precision and filter re-
call. Filter precision is defined as the ratio of cor-
rectly parsed sentences in the filtered set to total
sentences in the filtered set. Filter recall is defined
as the ratio of correctly parsed sentences in the fil-
tered set to correctly parsed sentences in the un-
filtered set. Note that these metrics are quite dif-
ferent from the labeled constituent precision/recall
metrics that are typically used to measure statisti-
cal parser performance.
Table 2 shows our overall results for filtering
parses using WOODWARD. We compare against
a baseline model that predicts every sentence is
parsed correctly. WOODWARD outperforms this
baseline in precision and F1 measure on both of
our data sets.
Collins (2000) reports a decrease in error rate
of 13% over his original parsing model (the same
model as used in our experiments) by performing
a discriminative reranking of parses. Our WSJ
test set is a subset of the set of sentences used
in Collins’ experiments, so our results are not di-
rectly comparable, but we do achieve a roughly
similar decrease in error rate (20%) when we use
our filtered precision/recall metrics. We also mea-
sured the labeled constituent precision and recall
of both the original test set and the filtered set, and
found a decrease in error rate of 37% according to
this metric (corresponding to a jump in F1 from
90.1 to 93.8). Note that in our case, the error is re-
33
duced by throwing out bad parses, rather than try-
ing to fix them. The 17% difference between the
two decreases in error rate is probably due to the
fact that WOODWARD is more likely to detect the
worse parses in the original set, which contribute a
proportionally larger share of error in labeled con-
stituent precision/recall in the original test set.
WOODWARD performs significantly better on
the TREC questions than on the Penn Treebank
data. One major reason is that there are far more
clause adjuncts in the Treebank data, and adjunct
errors are intrinsically harder to detect. Con-
sider the Treebank sentence: “The S&P pit stayed
locked at its 30-point trading limit as the Dow av-
erage ground to its final 190.58 point loss Friday.”
The parser incorrectly attaches the clause begin-
ning “as the Dow . . . ” to “locked”, rather than
to “stayed.” Our current methods aim to use key
words in the clause to determine if the attachment
is correct. However, with such clauses there is no
single key word that can allow us to make that de-
termination. We anticipate that as the paradigm
matures we and others will design filters that can
use more of the information in the clause to help
make these decisions.
4 Conclusions and Future Work
Given a parse of a sentence, WOODWARD con-
structs a representation that identifies the key se-
mantic relationships implicit in the parse. It then
uses a set of Web-based sampling techniques to
check whether these relationships are plausible.
If any of the relationships is highly implausible,
WOODWARD concludes that the parse is incorrect.
WOODWARD successfully detects common errors
in the output of the Collins parser including verb
arity errors as well as preposition and verb attach-
ment errors. While more extensive experiments
are clearly necessary, our results suggest that the
paradigm of Web-based semantic filtering could
substantially improve the performance of statisti-
cal parsers.
In future work, we hope to further validate this
paradigm by constructing additional semantic fil-
ters that detect other types of errors. We also plan
to use semantic filters such as WOODWARD to
build a large-scale corpus of automatically-parsed
sentences that has higher accuracy than can be
achieved today. Such a corpus could be used to
re-train a statistical parser to improve its perfor-
mance. Beyond that, we plan to embed semantic
filtering into the parser itself. If semantic filters
become sufficiently accurate, they could rule out
enough erroneous parses that the parser is left with
just the correct one.
Acknowledgements
This research was supported in part by NSF grant
IIS-0312988, DARPA contract NBCHD030010,
ONR grant N00014-02-1-0324 as well as gifts
from Google, and carried out at the University of
Washington’s Turing Center.

References
J. Allen. 1995. Natural Language Understand-
ing. Benjamin/Cummings Publishing, Redwood
City, CA, 2nd edition.
G. Attardi, A. Cisternino, F. Formica, M. Simi, and
A. Tommasi. 2001. PiQASso: Pisa Question An-
swering System. In TREC.
J. Bos, S. Clark, M. Steedman, J. R. Curran, and
J. Hockenmaier. 2004. Wide-coverage semantic
representations from a CCG parser. In COLING.
Michael J. Cafarella, Doug Downey, Stephen Soder-
land, and Oren Etzioni. 2005. KnowItNow: Fast,
scalable information extraction from the web. In
HLT-EMNLP.
M. J. Cafarella, M. Banko, and O. Etzioni. 2006. Re-
lational web search. UW Tech Report 06-04-02.
M. Collins. 2000. Discriminative reranking for natural
language parsing. In ICML, pages 175–182.
C. C. T. Kwok, O. Etzioni, and D. S. Weld. 2001. Scal-
ing question answering to the web. In WWW.
M. Lapata and F. Keller. 2005. Web-based models for
natural language processing. ACM Transactions on
Speech and Language Processing, 2:1–31.
K. Markert, N. Modjeska, and M. Nissim. 2003. Us-
ing the web for nominal anaphora resolution. In
EACL Workshop on the Computational Treatment of
Anaphora.
D. Moldovan, C. Clark, S. Harabagiu, and S. Maiorano.
2003. Cogex: A logic prover for question answer-
ing. In HLT.
K. Toutanova, C. D. Manning, D. Flickinger, and
S. Oepen. 2005. Stochastic HPSG parse disam-
biguation using the Redwoods Corpus. Journal of
Logic and Computation.
P.D. Turney. 2001. Mining the Web for Synonyms:
PMI–IR versus LSA on TOEFL. Lecture Notes in
Computer Science, 2167:491–502.
M. Volk. 2001. Exploiting the WWW as a corpus to
resolve PP attachment ambiguities. In Corpus Lin-
guistics.
