TTP: A FAST AND ROBUST PARSER FOR NATURAL LANGUAGE 
TOMEK STRZALKOWSKI 
Courant Institute of Mathematical Sciences 
New York University 
715 Broadway, rm 704 
New York, NY 10003 
tomek@cs.nyu.edu 
ABSTRACT 
In this paper we describe TI~, a fast and robust 
natural language parser which can analyze written text 
and generate regularized parse structures for sentences 
and phrases at the speed of approximately 0.5 
sec/sentence, or 44 word per second. The parser is 
based on a wide coverage grammar for English, 
developed by the New York University's Linguistic 
String Project, and it uses the machine-readable ver- 
sion of the Oxford Advanced lw~arner's Dictionary as a 
source of its basic vocabulary. The parser operates on 
stochastically tagged text, and contains a powerful 
skip-and-fit recovery mechanism that allows it to deal 
with extra-grammatical input and to operate effec- 
tively under a severe time pressure. Empirical experi- 
ments, testing parser's speed and accuracy, were per- 
formed on several collections: a collection of technical 
abstracts (CACM-3204), a corpus of news messages 
(MUC-3), a selection from ACM Computer Library 
database, and a collection of Wall Street Journal arti- 
cles, approximately 50 million words in total. 
1. INTRODUCTION 
Recently, there has been a growing demand for 
fast and reliable natural language processing tools, 
capable of performing reasonably accurate syntactic 
analysis of large volumes of text within an acceptable 
time. A full sentential parser that produces complete 
mmlysis of input, may be considered reasonably fast if 
the average parsing time per sentence falls anywhere 
between 2 and 10 seconds. A large volume of text, 
perhaps a gigabyte or more, would contain as many as 
7 million sentences. At the speed of say, 6 
sec/sentence, this much text would require well over a 
year to parse. While 7 million sentences is a lot of text, 
this much may easily he contained in a fair-sized text 
database. Therefore, the parsing speed would have to 
be increased by at least a factor of 10 to make such a 
task manageable. 
In this paper we describe a fast and robust 
natural language parser that can analyze written text 
and generate regularized parse structures at a speed of 
below 1 second per sentence. In the experiments con- 
ducted on variety of natural langauge texts, including 
technical prose, news messages, and newspaper arti- 
cles, the average parsing time varied between 0.4 
sec/sentence and 0.7 see/sentence, or between 1600 
and 2600 words per minute, as we tried to find an 
acceptable compromise between parser's speed and 
precision.l 
It has long been assumed that in order to gain 
speed, one may have to trade in some of the purser's 
accuracy. For example, we may have to settle for par- 
tial parsing that would recognize only selected gram- 
matical structures (e.g. noun phrases; Ruge et al., 
1991), or would avoid making difficult decisions (e.g. 
pp-attachment; Hindle, 1983). Much of the overhead 
and inefficiency comes from the fact that the lexical 
and structural ambiguity of natmal language input can 
only be dealt with using limited context information 
available to the parser. Partial parsing techniques have 
been used with a considerable success in processing 
large volumes of text, for example AT&T's Fidditch 
(Hindle and Rooth, 1991) parsed 13 million words of 
Associated Press news messages, while MIT's parser 
(de Marcken, 1990) was used to process the 1 million 
word Lancaster/Oslo/Bergen (LOB) corpus. In both 
cases, the parsers were designed to do partial process- 
ing only, that is, they would never attempt a complete 
analysis of certain constructions, such as the attach- 
ment of pp-adjuncts, subordinate clauses, or coordina- 
tions. This kind of partial analysis may be sufficient in 
some applications because of a relatively high preci- 
sion of identifying correct syntactic dependencies. 2 
However, the ratio at which these dependencies are 
identified (that is, the recall level) isn't sufficiently 
high due to the inherently partial character of the pars- 
ing process. The low recall means that many of the 
important dependencies are lost in parsing, and 
t These results were obtained on a 21 MIPS SparcStafion 
ELC. The experiments were performed within an information re- trieval system so that the final recall and precision statistics were 
used to rnealurc effectiwmess of the panmr. 
a Hindle and Rooth (1991) and Church and Hanks (1990) used 
partial parses generated by Fidditch to study word ~urrt.nc¢ pat- 
terns m syntactic contexts. 
ACRES DE COLING-92, NANTES, 23-28 AOr~ 1992 1 9 8 PROC. OF COL1NG-92. NANTES, AOO. 23-28, 1992 
therelore partial parsing may not be suitable in appli- 
cations such as information extraction or document 
retrieval. 
The alternative is to create a parser that would 
attempt to produce a complete parse, and would resort 
to partial or approxim~ analysis only under excep- 
tional conditions such as an extra-grammatical input or 
a severe time pressure. Encountering a construction 
that it couldn't handle, the parser would first try to pro- 
duec an approxinmte analysis of the difficult fragment, 
and then resume normal processing for the rest of the 
input. The outcome is a kind of "fitted" parse, 
reflecting a compromise between the actual input and 
grammar-encoded preferences (imposed, mainly, in 
rule ordering)) 
2. SKIP-ANI)-FIT RECOVERY IN PARSING 
A robust parser must deal efficiently with 
difficult input, whether it is an exUa-gmmmatical 
string, or a string whose complete analysis could be. 
considered too costly. Frequently, these two situations 
am not distinguishable, estmcially for long and com- 
plex sentences found iu free running text. The parser 
must be able to analyze such strings quickly and pro- 
duec at least partiM stractures, imposhlg preferences 
when necessary, and even removing or inserting small 
input fragments, if the data-driven processing falters. 
For example, in the following sentence, 
The method is illustrated by the automatic con- 
struction of both recursive and iterafive programs 
operating on natural numbers, lists, and tree.s, ht 
order to construct a program satisfying certain 
specifications a theorem induced by those 
specifu:ations is proved, and the desired program 
is extracted from the ptooL 
the italicized part is likely to cause additional compli- 
cations in parsing this lengthy string, and the parser 
may be better off ignoring the fragment altogether. To 
do so successfully, the parser must close the consti- 
tuent which is being culrenfly parsed, an(l lYossibly a 
few of its parent constituents, removing correspumling 
productions from further consideration, until an 
appropriate production is rcactivatexl, The parser then 
jumps over the iutervening inatedal .so as to re.start 
processing of the remainder of the sentence usiag rite 
newly reactivated production. In the example at hand, 
suppose that the parser has just read the word 
specifications and is looking at the following article a. 
Rather than continuing at the present level, the parser 
reduces the phrase a program satiyfying certain 
The idea of parse "fitting" was partly ialspired by the UIM 
parser (Jen~en et al., 1983), as well as by the sumdard error mcovely techniques used in shift-reduce parsiug. 
specifications to NP, and then traces further reduc- 
tions: SI --) to V NP; SA -~ SI; S .--) NP V NP SA, until 
production S --* S and S is reached. 4 Subsequently, the 
parser skips input to find and, then resumes normal 
processing. 
As may be expected, this kind of action involves 
a great deal of indeterminacy which, in case of natural 
language strings, is compounded by the high degree of 
lexical ambiguity. If the purpose of this skip-and-fit 
technique is to get the purser smoothly through even 
the most complex strings, the amount of additional 
backtracking caused by the lexical level ambiguity ks 
certain to defeat it. Without lexical disambigaation of 
input, the purser's performance will deteriorate, even 
if the .skipping is limited only to certain types of adver- 
bial adjuncts. The most common cases of lexical ambi- 
guity are tho~ of a phwal noun (nns) vs. a singular 
verb (vbz), a singular noun (nn) vs. a plmal or 
infinitive verb (vbp,vb), and a past tense verb (vbd) vs. 
a past participle (vbn), as illusWatod in the following 
exarnple. 
The notation used (vbn or vl~l?) explicitly asse.ci- 
ates (nns or vbz?) a data structure (vb or nn) 
shared (vbn or vbd?) by concun-ent processes (nn.,~ 
or vbz?) wiflt operatimLs defirmd (vbn or vbd?) cut 
it. 
3. PART OF SPEECH TAGGER 
Oue way of dealing with lexical ambiguity is to 
use a tagger to preproccss the input marking each 
wurti with a tags that indicates its syntactic categoriza.- 
tion: a part of speech with selected morphological 
features such as nunther, tense, mode, case and degree. 
The following are tagged sentcoces from the CACM- 
3204 collection: s 
The(dr) papei'(nn) pre~nts(vbz) a(dt) 
proposal(on) lor(/n) stmctured(vbn) 
representation(nn) of(in) multipmgranuning(vbg) 
in(in) a(dt) high(jj) level(tin) language(nn) .(per) 
The(tit) notation(nn) used(vbn) explicitly(rb) 
associates(vbz) ~dt) data0m.v ) structme(nn) 
shared(vbn) by(in) concmrent(/j) prc~esses(nns) 
with(in) t)peratit)ns(mJs) defined(vbn) on(in) 
it(pp) .(per) 
The tags are underst(xxl as follows: (It - determiner, nn 
- singular 1~oan, nns - plural noun, in - preposition, jj 
adjective, vbz - verb in present tense third person 
"lhe decision to force • reducti(m rather than to back up co~ld be triggered by various means. In clte of TTP parser, 
it iJ al- 
ways induced by the thne-citt lignal. 
Tagged u~ing the 35-tag Penn 'ft,zebank Tagset cmmed at the 
University of Pemtsylwmia. 
Acq~.s DE COLING-92, NA~'I~, 23°28 Ao(rr 1992 1 9 9 PROC. OF COLlNG-92, NANrF.s, AUo. 23-28, 1992 
singular, to - particle "to", vbg - present participle, vim 
- past participle, vbd - past tense verb, vb - infinitive 
verb, cc - coordinate conjunction. 
Tagging of the input text substantially reduces 
the search space of a top-down parser since it resolves 
most of the lexical level ambiguities. In the examples 
ahove, tagging of presents as "vbz" in the first sen- 
tence cuts off a potentially long and cosily "garden 
path" with presents as a plural noun followed by a 
headless relative clause starting with (that) a proposal 
.... In the second sentence, tagging resolves ambiguity 
of used (vim vs. vbd), and associates (vbz vs. nns). 
Perhaps more imlxmantly, elimination of word-level 
lexical ambiguity allows the parser to make projection 
about the input which is yet to be parsed, using a sim- 
ple lookabead; in particular, phrase boundaries can be 
determined with a degree of confidence (Church, 
1988). This latter property is critical for implementing 
skip-and-fit recovery technique outlined in the previ- 
ous section. 
Tagging of input also helps to reduce the 
number of parse structures that can be assigned to a 
sentence, decreases the demand for consulting of the 
dictionary, and simplifies dealing with unknown 
words. Since every item in the sentence is assigned a 
tag, so are the words for which we have no entry in the 
lexicon. Many of these words will be tagged as "np" 
(proper noun), however, the surrounding tags may 
force other selections. In the following example, 
chinese, which does not appear in the dictionary, is 
tagged as "j.j":~ 
this(dO papca'(nn) dates(vbz) back(rb) the(d 0 
genesis(nn) of(in) binary(j/) conception(nn) 
circa(/n) 5000(cd) years(nns) ago(rb) ,(corn) 
as(rb) derived(vbn) by(m) the(d 0 chinese(if) 
ancients(nns) .(per) 
We use a stochastic tagger to process the input 
text prior to parsing. The tagger is based upon a bi- 
gram model; it selects most likely tag for a word given 
co-occurrence probabilities computed from a small 
training SgL 7 
4. PARSING wITH TTP PARSER 
TTP (Tagged Text Parser) is a top down English 
parser specifically designed for fast, reliable process- 
ing of large amounts of text. 
6 We use the machine wadable version of the Oxford Ad- vanced Learner's Dictionary (OALD). 
7 The program, suppfiod to us by Bolt Benmck and Newman, 
openttes in two almmative modes, either telocting • single most like- 
ly tag for each word (best-tag option, the one we use •t prcaenO, or 
supplying t slion tanked list of alternatives (Mercer et al., 1991). 
TTP is based on the Linguistic String Grammar 
developed by Sager (1981). Written in Quintus Pro- 
log, the parser currently encompasses more than 400 
grammar productions, s TIP produces a regularized 
representation of each lmrsed sentence that reflects the 
sentence's logical structure. This representation may 
differ considerably from a standard Imrse tree, in that 
the constituents get moved around (e.g., de. 
passivization, de--dativization), and the phrases are 
organized recursively around their head elements. An 
important novel feature of TIP parser is that it is 
equipped with a time-out mechanism that allows for 
fast closing of more difficult sub-constituents after a 
preset amount of time has elapsed without producing a 
parse. Although a complete analysis is attempted for 
each sentence, the parser may occasionally ignore 
fragments of input to resume "normal" processing after 
skipping a few words. These fragments are latex 
analyzed separately and attached as incomplete consti- 
tuents to the main parse tree. 
As the parsing ixoceeds, each sentence receives 
a new slot of time during which its parse is to be 
returned. The amount of time allotted to any particular 
sentence can be regulated to obtain an acceptable 
compromise between parser's speed and precision. In 
our experiments we found that 0.5 see/sentence time 
slot was appropriate for the CACM abstracts, while 
0.7 see/sentence was more appropriate for generally 
longer sentences in MUC-3 articles. 9 The actual length 
of the time interval allotted to any one sentence may 
depend on this sentence's length in words, although 
this dependency need not be linear. Such adjustments 
will have only limited impact on the parser's speed, 
but they may affect the quality of produced parse trees. 
Unfortunately, there is no obvious way to evaluate 
quality of parsing except by using its results to attain 
some measurable ends. We used the parsed CACM 
collection to generate domain-specific word correla- 
tions for query processing in an information retrieval 
system, and the results were satisfactory. For other 
applications, such as information extraction and deep 
understanding, a more accurate analysis may be 
required, m 
* See (Strzalkowski, 1990) for Prolog implementation details. 
Giving the parser more time per sentence doesn't always mean that • belmr (more accurate) parse 
will be obtained. For com- 
plex or extra-grammatical structures we are likely to be better o(f if 
we do not allow the parser wander around for too long: the molt 
likely inteq~mtation of an unexpected input is probably the one gcn- 
cnlted early (the grammar rule ordering en forces some preferences). 
Jo A qualitative method for par~cr evaluation has he~a pro- 
\[me.ed in (ihrrison et al,, 1990, and it may be used to mike • rd•- 
tire comtxtrison of purser's accuracy. What is not dear is how •oeu- 
ate a par~er needs to be for may particular apptic.iticct. 
ACTES DE COLING-92, NANTES, 23-28 AOt3T 1992 2 0 0 PROC. OF COLING-92, NANTES, AUG. 23-28, 1992 
Initially, a full analysis of each sentence is 
attempted. If a parse is not returned before the allotted 
time elapses, the parser enters the time-out mode. 
From this point on, the parser is permitted to skip por- 
tions of input to reach a starter terminal for the next 
constituent to be parsed, and closing the currently opea 
one (or ones) with whatever partial representation has 
been generated thus far. The result is an approximate 
partial parse, which shows the overall structure of the 
sentence, from which some of the constituents may be 
missing. The fragments skipped in the first pass are 
not thrown out, instead they are analyzed by a simple 
phrasal post-processor that looks for noun phrases and 
relative clauses and then attaches the recovered 
material to the main parse structure. 
The time-out mechanism is implemented using a 
straightforward parameter passing and is at present 
limited to only a sub~et of nonterminals used by the 
grammar. Suppose that X is such a nonterminal, and 
that it appears on the right-hand side of a production S 
---> X Y Z. The set of "starters" is computed for Y, 
which consists of the word tags that can occur as the 
left-most constituent of Y. This set is passed as a 
parameter while the parser attempts to recognize X in 
the input. If X is recognized successfully within a 
preset time, then the parser proceeds to parse a Y, and 
nothing else happens. On the other hand, if the parser 
cannot determine whether there is an X in the input or 
not, that is, it neither succeeds nor fails in parsing X 
before being timed out, the unfinished X constituent is 
closed with a partial l~rse, and the parser is restarted 
at the closest element from the sta~ers set for Y that 
can be found in the remainder of the input. If Y 
rewrites to an empty string, the starters for Z to the 
right of Y are added to the starters for Y and both sets 
are passed as a parameter to X. As an example con- 
sider the following clauses in the TIP parser: ~1 
sentence(P) :- assertion(\[\],P). 
assertion (SR, P) : - 
clause(SR,Pl),s coord(SR, PI,P). 
clause (SR, P) :- 
sa ( \[pdt, dr, cd, pp, ppS, J j, Jjr, 
j Js, nn, nns, np, nps\] ,PAl) , 
subject ( \[vbd, vbz, vbp\], Tail, P 1 ), 
verbphrase (SR, Tail, PI, PAl, P) , 
subtail (Tail) . 
thats (SR, P) :- 
that, assertion (SR, P) . 
In the clause production above, a (finite) clause 
n The clauses arc slightly simplified, and some arguments are 
removed for expository reasons. 
rewrites into an (optional) sentence adjunct (SA), a 
subject, a verbphrase and subject's right adjunct 
(SUBTAIL, also optional). With the exception of sub- 
tail, each predicate has a parameter that specifies the 
list of "starter" tags for restarting the parser, should the 
evaluation of this predicate exceed the allotted portion 
of time. Thus, in case sa is aborted before its evalua- 
tion is complete, the parser will jump over some ele- 
menUs of the unparsed portion of the input looking for 
a word that could begin a subject phrase (either a pre- 
determiner, a determiner, a count word, a pronoun, an 
adjective, a noun, or a proper name). Likewise, when 
subject is timed out, the parser will restart with 
verbphrase at either vbz, vbd or vbp (finite forms of a 
verb). Note that if verbphrase is timed out, then subtail 
will be ignored, both verbphrase and clause will be 
closed, and the parser will restart at an element of set 
SR passed down to clause from assertion. Note also 
that in the top-level production for a sentence the star- 
ter set for assertion is initialized to be empty: if the 
failure occurs at this level, no continuation is possible. 
When a non-terminal is timed out and the parser 
jumps over a non-zero length fragment of input, it is 
assumed that the skipped part was some sub- 
constituent of the closed non-terminal. Accordingly, a 
place holder is left in the parse structure under the 
node dominated by this non-terminal, which will be 
later filled by some nominal material recovered from 
the fragment. The examples given in the Appendix 
show approximate parse structures generated by TIP. 
There are a few caveats in the skip-and-fit pars- 
ing strategy just outlined which warrant further expla- 
nation. In particular, the following problems must be 
resolved to assure parser's effectiveness: how to select 
starter tags for non-terminals, how to select non- 
terminals at which to place the starter tags, and finally 
how to select non-terminals at which input skipping 
call occur. 
Obviotlsly some tags are mote likely to occur at 
the left-most position of a constituent than others. 
~ly, a subject ~ can start with u word 
tagged with any element from the following fist: Ixlt, 
dt, cd, ji, jjr, jjs, pp, ppS, nn, nns, np, nps, vbg, vbo, rb, 
in} 2 In practice, however, we may select only a subset 
of these, as shown in the clause production above. 
Although we now risk missing the left-hand boundary 
of subject p~rases in some sentences, while skipping 
an adjunct to their left, most cases are still covered and 
the chances of making a serious misinterpretation of 
u Thit list it .ot comphac. In addition to the tal~ explthled before: pdt - \[n~de~trniner, jjt - compamtlve *djcctiv¢, j~ - mpcda- 
tire ~.ieO~c, pp - pronoun, ppS - s~nitiv¢, rlp, npl - p,x~l,er noun. r'o 
- ~verb. 
ACTES DI~; COLING-92. NANTES. 23-28 nor\]r 1992 2 0 l PROC. OF COLING-92. NANTES. AUG. 23-28, 1992 
input are significantly lower. 
We also need to decide on how input skipping is 
to be done. In a most straightforward design, when a 
nonterminal X is timed-out, the parser would skip 
input until it has reached a starter element of a nonter- 
minal Y adjacent to X from the right, according to the 
top-down predictions, t3 On the other hand, certain 
adjunct phrases may be of little interest, possibly 
because of their typically low information contents, 
and we may choose to ignore them altogether. There- 
fore, if X is timed out, and Y is a low contents adjunct 
phrase, we can make the parser to jump fight to the 
next nonterminal Z. In the clause production discussed 
before, subtail is skipped over if verbphrase is timed 
ouL 14 
Finally, it is not an entirely trivial task to select 
non-terminals at which the input skipping can occur. If 
wrong non-terminals are chosen the parser may gen- 
erate rather uninteresting structures that would be next 
to useless, or it may become trapped in inadvertently 
created dead ends, hopelessly trying to fit the parse. 
Consider, for example, the following sentence, taken 
from MUC-3 corpus of news messages: 
HONDURAN NATIONAL POLICE ON MON- 
DAY PRESENTED TO THE PRESS HON- 
DURAN JUAN BAUTISTA NUNEZ AMADOR 
AND NICARAGUAN LUIS FERNANDO OR- 
DON\[~ REYES, WHO TOLD REPORTERS 
THAT COMMANDER AURELIANO WAS AS- 
SASSINATED ON ORDERS FROM JOSE DE 
JESUS PENA, THE NICARAGUAN EMBASSY 
CHIEF OF SECURITY. 
After reaching the verb PRESENTED, the parser con- 
salts the lexicon and finds that one of the possible sub- 
categorizations of this verb is \[pun,to\], that is, its 
object suing can be a prepositional phrase with 'to' 
followed by a noun phrase. The parser thus begins to 
look for a prepositional phrase starting at "TO THE 
PRESS ...", but unfortunately misses the end of the 
phrase at PRESS (the following word is tagged as a 
noun), and continues until reaching the end of sen- 
tence. At this point it realizes that it went too far (there 
is no noun phrase left), and starts backing up. Before 
the parser has a chance to back up to the word PRESS 
and correct the early mistake, however, the time-out 
mode is turned on, and instead of abandoning the 
current analysis, the parser now tries hard to fix it by 
skipping varying portions of input. This may take a 
considerable amount time if the skip points are badly 
i~ Note that the top-down predictions are crucial for the skip- 
ping parser, wheahcr the paner's processing is top-down or bouem- up. 
t4 :mbta//it the remainder of a discontinued subject phrase. 
placed. On the other hand, we wouldn't like to allow 
an easy exit by accepting an empty noun phrase at the 
end of the sentenceI \]5 
One of the essential properties of the input skip- 
ping mechanism is its flexibility to jump over 
varying-size chunks of the input sUing. The goal is to 
fit the input with a closest matching parse structure 
while leaving the minimum number of words unac- 
counted for. In TIP, the skipping mechanism is imple- 
mented by adding extra productions for selected non- 
terminals, and these are always tried fast whenever the 
nonterminal is to be expanded. We illustrate this with 
rn productions covering fight adjuncts to a noun. 
rn (SR, P) :- 
timed out, !, 
skip (SR), store (P) . 
rn(_, \[\]) :- 
la ( \[ \[pdt, dt, vbz, vbp, vbd, 
rod, eom, ha, rmr\] \] ), 
\+is ( \[ \[C0~\] , \[wdt,wp,wps\] \] ) . 
rn(SR,P) :- rnI(SR, P). 
In the rn predicate, SR is the list of starter tags and P is 
the parse tree fragment. The first production checks if 
the time-out mode has already been entered, in which 
case the input is skipped until a starter tag is found, 
while the skipped words are stored into P to be 
analyzed later in the purser's second pass. Note that in 
this case all other rn productions are cut off; however, 
should the first skip-and-fit attempt fail to lead to a 
successful parse, backtracking may eventually force 
predicate skip(SR) to evaluate again and make a longer 
leap. In a top-down left to right parser, each input 
skipping location becomes potentially a multiple buck- 
tracking point which needs to be controlled in order to 
avoid a combinatorial explosion of possibilities. This 
is accomplished by supplementing top-down predic- 
tions with bottom-up, data-driven fragmentation of 
input, and a limited lookahead. For example, in the 
second of the rn productions above, a right adjunct to a 
noun can be considered empty if the item following 
the noun is either a period, a semicolon, a comma, or a 
word tagged as pdt, dt, vbz, vbp, vbd, or md, but not a 
comma followed by a relative pronoun.~6 
,2 In the present implementation, when the skipping mode is 
entered, it will stay on for the balance of the first pass in parsing of 
the current sentence. "\[~his way, o~¢ skip-and-fit attempt may lead to 
anc4her before any backtracking is considered. An altemafive is to 
do time-out on a nonterminal by nonterminal basis, that is, to time 
out processing of selected nonterminals only and then resume regular 
parsing, qhis design leads to a far more complex implementation and 
somewhat inferior performance, but it might be worth comic~ring in 
the fumre. 
t6 md - modal veto; vbp - plural verb; wdt, wp, wps - ttladve 
pronouns. 
ACq'ES DE COLING-92, NANTES, 23-28 AOt3T 1992 2 0 2 PROC. OF COLING-92, NANTES, AUG. 23-28, 1992 
5. ROBUSTNESS 
TIP is a robust parser and it will process nearly 
every sentence or phrase, provided the latter is reason- 
ably correctly tagged. 17 The lmrser robustness is 
further increased by allowing for a gradual degrada- 
tion of its performance rather than an outright failure 
in the face of an unexpected input. Each sentence or 
phrase is attempted to be analyzed in up to four ways: 
(1) as a sentence, (2) as a noun phrase or a preposition 
phrase with a right adjunct(s), (3) as a gemndive 
clause, and if all these fail, (4) as a series of simple 
noun phrases, with each of these attempts allotted a 
fresh time slice) s The purpose of this extension is to 
accommodate some infrequent but still important con- 
strnctions, such as dries, itemizations, and lists. 
6. DISCUSSION 
In this paper we described TIP, a fast and 
robust parser for natural language. In the experiments 
conducted with various text collections of more that 50 
million words the average parsing speed recorded was 
approx. 0.5 sec/sentence. For example, the total time 
spent on parsing the CACM-3204 collection was less 
than 1.5 hours. In other words, TIP can process 
100,000 words in approximately 45 minutes, and it 
could parse a gigabyte of text (approx. 150 million 
words) in about 40 days, on a 21 MIPS computer. 
The parser is based on a wide coverage grammar 
for English, and it contains a powerful skip-and-fit 
recovery mechanism that allows it to deal with unex- 
pected input and to perform effectively under a severe 
time pressure. Prior to parsing, the input text is tagged 
with a stochastic tagger that assigns part-of-speech 
labels to every word, thus resolving lexical level ambi- 
guity. 
TIP has been used as front-end of a natural 
language processing component to a traditional 
document-based information retrieval system (Strzal- 
kowski and Vauthey, 1992). The parse structures were 
further analyzed to extract word and phrase depen- 
dency relations which were in turn used as input to 
various statistical and indexing processes. The results 
obtained were generally satisfactory: an improvement 
in both recall and precision of document retrieval have 
been observed. At present, we are also conducting 
experiments with large corpora of technical computer 
,science texts in order to extract domain-specific 
,7 Some sentences (1 in 5000) mmy still fail to parse if tagging 
errors are. compotmded in In unexpected way. 
ts Although parsing of some sentences may now approach 
four drnes the allotted time limit, we noted that the average parsing 
tinm per sentence at 0.745 sec. is only slighdy above the time-out 
limit. 
conceptual taxonomies for an even greater gain in 
retrieval effectiveness. 
7. ACKNOWLEDGEMENTS 
We wish to thank Ralph Weischedel and Heidi 
Fox of BBN for assisting in the use of the part of 
speech tagger. ACM has generously provided us with 
the Computer Library text database. This paper is 
based upon work supported by the Defense Advanced 
Research Project Agency under Contract N00014-90- 
J-1851 from the Office of Naval Research, the 
National Science Foundation under Grant IRI-89- 
02304, and by the Canadian Institute for Robotics and 
Intelligent Systems (IRIS). 
APPENDIX: Sample parses 
A few examples of non-standard output gen- 
erated by TTP are shown in Figures 1 to 3. In Figure 1, 
"ITP has failed to find the main verb and it had to jump 
over much of the last phrase such as the LR(k) gram- 
mars, partly due to an improper tokenization of LR(k) 
(note skipped nodes indicating the material ignored in 
the first pass). In Figure 2, the parser has initially 
assumed that the conjunction in the sentence has the 
narrow scope, then it realized that something went 
wrong but, apparently, there was no time left to back 
up. Note, however, that little has been lost: a complete 
strncture of the second half of this sentence following 
the conjuction and is easily recovered from the parse 
tree (var points up to the dominating rip). Occasion- 
ally, sentences may come out substantially truncated, 
as shown in Figure 3, where although has been mis- 
tagged as a preposition. 
SENTF~CE: 
The problem of determining whether an arbitrary 
context-free grammar is a member of some easily 
parsed subclass of grammars such as the LR(k) 
grammars is considered. 
APPROXIMATE PARSE: 
\[ \[verb,\[\] \],\[subject, \[np,\[n,problem\] ,\[t_pos,the\], 
\[of,\[\[verb,\[determine\]\],\[subject,anyone\], 
\[object,\[\[verb,\[be\]\], 
\[subject,\[np,\[n,grammar\],\[t_pos,an\], 
\[adj,\[arbitrary\]\],\[adj,\[context free J\]\]\], 
\[object,\[np,\[n,member\],\[t_pos,a\], 
\[of,\[np,\[n,subclass\],\[t_pos,some\], 
\[a pos_v, 
\[\[verb,\[parse,lady,easily\]I\], 
\[subject,anyone\], 
\[object,pro\]\]\], 
\[of.\[np,\[n,grammar\], 
\[rn wb,\[\[verb,\[such\]\]. 
\[subject,var\]\]\]\]\]\]\]\]\]\] \], 
\[sub_urd,Ias,\[\[verb,\[be\] \] ,\[subject,pro\], 
\[object,\[np,\[n,kl,\[t_pos,the\], 
\[adj,\["lrC\]\]\]\]\]\]\], 
\[skipped,\[\[np,\[n,grammar\]\]\]\]\]\]\]\], 
\[skipped,\[\[is\],\[wh rel,\[\[verb,\[consider\]\], 
\[sabject,anyone\],\[object,var\]\]\]\]\]\]. 
Figure 1. 
SENTENCE: 
The TX-2 computer at MIT Lincoln Laboratory was used 
for the implementation of such a system and the 
characteristics of this implementation are reported. 
APPROXIMATE PARSE: 
\[\[bc\],\[\[verb,\[usc\]\], 
\[subject,anyone\], 
\[object,\[np,\[n,compster\],\[t_pos,the\],\[adj,\[tx_2\]\]\]\], 
\[for,\[and, 
\[np,\[n,implem entation\] ,\[t._pos,the\], 
\[of,\[np,\[n,system\] ,\[t pos,\[such,a\]\]\]\]\], 
\[np,\[n,characteristics\],\[t_.pos,the\], 
\[o f,\[np,\[n,implementation\] ,\[t pos,this\]\]\], 
\[skipped, 
\[\[are\] ,\[w h_rel,\[\[verb,\[report\]\], 
\[subject,anyone\], 
\[object,var\]\]\]\]\]\]\]\]\], 
\[at,\[np,\[n,laboratory\],\[adj,\[mitl\], 
\[n pos,\[np,\[n,lincoln\]\]\]\]\]\]. 
Figure 2. 
SENTENCE: 
In principle, the system can deal with any ortho- 
graphy, although at present it is limited to 4000 
Chinese characters and some mathematical symbols. 
APPROXIMATE pARSE: 
\[\[can_anx\],\[\[verb.\[deal\]\]. 
\[suhject,\[np,\[n,system\].\[t pos,the\]\]\], 
\[sub_oral,\[with,\[\[verb,\[limit\]\], 
\[subject.anyone\], 
\[object, \[skipped. 
\[ \[np.\[n,orthography\] ,\[t pos,any\]\]. 
\[cornS.although.at\]. 
\[np.\[n,present\]\], 
\[np,\[n,it\]\], 
\[is\]\]\]\], 
\[to,\[np,\[n,character\], 
\[counl,\[4000\]\], 
\[a_pos,\[chinese\]\], 
\[skipped, 
\[\[and\], 
\[np,\[n,symbol\],\[t~pos,some\], 
\[adj,\[mathematical\]\]\]\]\]\]\]\]\]\]\], 
\[in,\[np,\[n,principle\]\]\]\]. 
Figure 3. 
AcrEs DE COLING-92, NANTES, 23-28 hotzr 1992 2 0 4 PROC. OF COL1NG-92, NANTES, AUG. 23-28, 1992 
ACRES DE COLING-92. NANTES, 23-28 AOI3T 1992 2 0 3 PROC. OF COLING-92, NANTES, AUG. 23-28, 1992 

REFERENCES 

Church, Kenneth Ward. 1988. "A Stochastic Parts 
Program and Noun Phrase Parser for Unrestricted 
Text." Proceedings of the Second Conference 
on Applied Natural Language Processing, pp. 
136-143. 

Church, Kenneth Ward and Patrick Hanks. 1990. 
"Word association norms, mutual information, 
and lexicography." Computational Linguistics, 
16(1), MIT Press, pp. 22-29. 

De Marcken, Carl G. 1990. "Parsing the LOB 
corpus." Proceedings of the 28th Meeting of the 
ACL, Pittsburgh, PA. pp. 243-251. 

Harrison, Philip, et al. 1991. "Evaluating Syntax Per- 
formance of Parser/Grammars of English." 
Natural Language Processing Systems Evaluatiou 
Workshop, Berkeley, CA. pp. 71-78. 

Hindle, Donald. 1983. "User manual of Fidditch, a 
deterministic parser." Naval Research Labora- 
tory Technical Memorandum 7590-142. 

Hindle, Donald and Mats Rooth. 1991. "Structural 
Ambiguity and Lexical Relations." Proceedings 
of the 29th Meeting of the ACL, Berkeley, CA. 
pp. 229-236. 

Jensen, K., G.E. Heidorn, L.A. Miller, and Y. Ravin. 
1983. "Parse fitting and prose fixing: Getting a 
hold of ill-formedness." Computational Linguis- 
tics, 9(3.-4), pp. 147-161. 

Meteer, Marie, Richard Schwartz, and Ralph 
Weischedel. 1991. "Studies in Part of Speech 
Labelling." Proceedings of the 4th DARPA 
Speech and Natural Language Workshop, 
Morgan-Kaufman, San MateD, CA. pp. 331-336. 

Ruge, Gerda, Christoph Schwarz, Amy J. Warner. 
1991. "Effectiveness and Efficiency in Natural 
Language Processing for Large Amounts of 
Text." Journal of the ASIS, 42(6), pp. 450-456. 

Sager, Nanmi. 1981. Natural Language Information 
Processing. Addison-Wesley. 

Strzalkowski, Tomek. 1990. "Reversible logic gram- 
mars for natural language parsing and 
generation.'" Computational Intelligence, 6(3), 
NRC Canada, pp. 145-171. 

Strzalkowski, Tomek and Barbara Vauthey. 1992. 
"Information Retrieval Using Robust Natural 
Language Processing." Proceedings of the 30th 
Annual Meeting of the ACL, Newark, Delaware, 
June 28 - July 2. 
