XML Viewer - p81-1001

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/81/p81-1001_metho.xml
Size: 32,630 bytes
Last Modified: 2025-10-06 14:11:22
<?xml version="1.0" standalone="yes"?>
<Paper uid="P81-1001">
  <Title>A Practical Comparison of Parsing Strategies</Title>
  <Section position="2" start_page="0" end_page="2" type="metho">
    <SectionTitle>
THE SRI EXPERIMENTS
</SectionTitle>
    <Paragraph position="0"> In this section we report the experiments conducted at SRI. First, the parsers and their strategy variations are described and intuitively compared; second, the grammars are described in terms of their purpose and their coverage; third, the sentences employed in the comparisons are discussed with regard to their source and presumed generality; next, the methods of comparing performance are detailed; then the results of the major experiment are presented. Finally, three small follow-up experiments are reported as anecdotal evidence.</Paragraph>
    <Paragraph position="1"> The Parsers and Strategies One of the parsers employed in the SRI experiments was LIFER: a top-down, depth-first parser with automatic back-up \[Hendrix, 1977\]. LIFER employs special &amp;quot;look down&amp;quot; logic based on the current word in the sentence to eliminate obviously fruitless downward expansion when the current word cannot be accepted as the leftmose element in any expansion of the currently proposed syntactic category \[Griffiths and Petrick, 1965\] and a &amp;quot;well-formed substring table&amp;quot; \[Woods, 1975\] to eliminate redundant pursuit of paths after back-up. LIFER supports a traditional style of rule writing where phrase-structure rules are augmented by (LISP) procedures which can reject the application of the rule when proposed by the parser, and which construct an interpretation of the phrase when the rule's application is acceptable. The special user-definable routine responsible for evaluating the S-level rule-body procedures was modified to collect certain statistics but reject an otherwise acceptable interpretation; this forced LIFER into its back-up mode where it sought out an alternate interpretation, which was recorded and rejected in the same fashion. In this way LIFER proceeded to derive all possible interpretations of each sentence according to the grammar. This rejection behavior was not entirely unusual, in that LIFER specifically provides for such an eventuality, and because the grammars themselves were already making use of this facility to reject faulty interpretations. By forcing LIFER to compute all interpretations in this natural manner, it could meaningfully be compared with the other parsers.</Paragraph>
    <Paragraph position="2"> The second parser employed,in the 5RI experiments was DIAMOND: an all-paths bottom-up parser \[Paxton, lg77\] developed at SRI as an outgrowth of the SRI Speech Understanding Project \[Walker, 1978\]. The basis of the implementation was the Cocke-Kasami-Younger algorithm \[Aho and Ullman, 1972\], augmented by an &amp;quot;oracle&amp;quot; \[Pratt, 1975\] to restrict the number of syntax rules considered.</Paragraph>
    <Paragraph position="3"> DIAMOND is used during the primarily syntactic, bottom-up phase of analysis; subsequent analysis phases work top-down through the parse tree, computing more detailed semantic information, but these do not involve DIAMOND per se. DIAMOND also supports a style of rules wherein the grammar is augmented by LISP procedures to either reject rule application, or compute an interpretation of the phrase.</Paragraph>
    <Paragraph position="4"> The third parser used in the SR~ experiments is dubbed CKY. It too is an i~lementation of the Cocke-Kasami-Younger algorithm. Shortly after the main experiment it WAS augmented by &amp;quot;top-down filtering,&amp;quot; and some shrillscale tests were conducted. Like Pratt's oracle, top-down filtering rejects the application of certain rules dlstovered'up by the bottom-up parser specifically, those that a top-aown parser would not discover. For example, assuming a grammar for English in a traditional style, and the sentence, &amp;quot;The old man ate fish,&amp;quot; an ordinary bottom-up parser will propose three S phrases, one each for: &amp;quot;man ate fish,&amp;quot; &amp;quot;old man ate fish,&amp;quot; and &amp;quot;The old man ate fish.&amp;quot; In isolation each is a possible sentence. But a top-down parser will normally propose only the last string as a sentence, since the left contexts &amp;quot;The old&amp;quot; and &amp;quot;The&amp;quot; prohibit the sentence reading for the remaining strings. Top-down filtering, then, is like running a top-down parser in parallel with a bottom-up parser. The bottom-up parser (being faster at discovering potential rules) proposes the rules, and the top-down parser (being more sensitive to context) passes judgement. Rejects are discarded immediately; those that pass muster are considered further, for example being submitted for feature checking and/or semantic interpretation.</Paragraph>
    <Paragraph position="5"> An intuitive prediction of practical performance is a somewhat difficult matter. ~FER, while not originally intended to produce all interpretations, does support a reasonably natural mechanism for forcing that style of analysis. A large amount of effort was invested in making LIFER more and more efficient as the LADDER linguistic component grew and began to consume more space and time. In CPU time its speed was increased by a factor of at least twenty with respect to its original, and rather efficient, implementation. One might therefore expect LIFER to compare favorably with the other parsers, particularly when interpreting the LADDER grammar written with LIFER, and only LIFER, in mind. DIAMOND, while implementeing the very efficient Cocke-Kasami-Younger algorithm and being augmented with an oracle and special programming tricks (e.g., assembly code) intended to enhance its performance, is a rather massive program and might be considered suspect for that reason alone; on the other hand, its predecessor was developed for the purpose of speech understanding, where efficiency issues predominate, and this strongly argues for good performance expectations. Chester's implementation of the Cocke-Kasami-Younger algorithm represents the opposite extreme of startling simplicity.</Paragraph>
    <Paragraph position="6"> His central algorithm is expressed in a dozen lines of LISP code and requires little else in a basic implementation. Expectations here might be bi-modal: it should either perform well due to its concise nature, or poorly due to the lack of any efficiency aids. There is one further consideration of merit: that of interprogrammer variability. Both LIFER and Chester's parser were rewritten for increased efficiency by the author; DIAMOND was used without modification. Thus differences between DIAMOND and the others might be due to different programming styles -- indeed, between DIAMOND and CKY this represents the only difference aside from the oracle --while differences between LIFER and CKY should reflect real performance distinctions because the same programmer (re)implemented them both.</Paragraph>
    <Paragraph position="7"> The Grammars The &amp;quot;semantic grammar&amp;quot; employed in the SRI experiments had been developed for the specific purpose of answering questions posed in English about the domain of ships at sea \[Sacerdoti, 1977\]. There was no pretense of its being a general grammar of English; nor was it adept at interpreting questions posed by users unfamiliar with the naval domain. That is, the grammar was attuned to questions posed by knowledgeable users, answerable from the available database. The syntactic categories were labelled with semantically meaningful names like &lt;SHIP&gt;, &lt;ARRIVE&gt;, &lt;PORT&gt;, and the like, and the words and phrases encompassed by such categories were restricted in the obvious fashion. Its adequacy of coverage is suggested by the success of LADDER as a demonstration vehicle for natural language access to databases \[Hendrix et al., 1978\].</Paragraph>
    <Paragraph position="8"> The linguistic grammar employed in the SRI experiments came from an entirely different project concerned with discourse understanding \[Grosz, 1978\]. In the project scenario a human apprentice technician consults with a computer which (s expert at the disassembly, repair, and reassembly of mechanical devices such as a pump. The computer guides the apprentice through the task, issuing instructions and explanations at whatever levels of detail are required; it may answer questions, describe appropriate tools for specific tasks, etc. The grammar used to interpret these interactions was strongly linguistically motivated \[Robinson, Ig8O\]. Developed in a domain primarily composed of declarative and imperative sentences, its generality is suggested by the short time (a few weeks) required to extend its coverage to the wide range of questions'encountered in the LADDER domain.</Paragraph>
    <Paragraph position="9"> In order to prime the various parsers with the different frammars, four programs were written to transform each grammar into the formalism expected by the two parsers for which it was not originally writtten. Specifically, the linguistic grammar had to be reformatted for input to LIFER and CKY; the semantic grammar, for input to CKY and DIAMDNO. Once each of six systems was loaded with one parser and one grammar, the stage would be set for the experiment.</Paragraph>
    <Section position="1" start_page="2" end_page="2" type="sub_section">
      <SectionTitle>
The Sentences
</SectionTitle>
      <Paragraph position="0"> Since LADDER's semantic grammar had been written for sentences in a limited domain, and was not intended for general English, it was not possible to test that grammar on any corpus outside of its domain. Therefore, all sentences in the experiment were drawn from the LADDER benchmark: the broad collection of queries designed to verify the overall integrity of the LADDER system after extensions had been incorporated. These sentences, almost all of them questions, had been carefully selected to exercise most of LADDER's linguistic and database capabilities. Each of the six sy~ems, then, was to be applied to the analysis of the same 249 benchmark sentences; these ranged in length from 2 to 23 words and averaged 7.82 words.</Paragraph>
    </Section>
    <Section position="2" start_page="2" end_page="2" type="sub_section">
      <SectionTitle>
Methods of Comparison
</SectionTitle>
      <Paragraph position="0"> Software instrumentation was used to measure the following: the CPU time; the number of phrases (instantiations of grammar rules) proposed by the parser; the number of these rejected by the rule-body procedures in the usual fashion; and the storage requirements (number of CONSes) of the analysis attempt. Each of these was recorded separately for sentences which were parsed vs. not parsed, and in the former case the number of interpretations was recorded as we11. For the experiment, the database access code was short-circuited; thus only analysis, not question answering, was performed. The collected data was categorized by sentence length and treatment (parser and grammar) for analysis purposes.</Paragraph>
      <Paragraph position="1"> Summary of the First Experiment The first experiment involved the production of six different instrumented systems -- three parsers, each with two grammars -- and six test runs on the identical set of 249 entences comprising the LADDER benchmark.</Paragraph>
      <Paragraph position="2"> The benchmark, established quite independently of the experiment, had as its raison d'etre the vigorous exercise of the LADDER system for the purpose of validationg its integrity. The sentences contained therein were intended to constitute a representative sample of what might be expected in that domain. The experiment was conducted on a DEC KL-IO; the systems were run separately, during low-load conditions in order to minimize competition with other programs which could confound the results.</Paragraph>
      <Paragraph position="3"> The Experimental Results As it turned out, the large internal grammar storage overhead of the DIAMOND parser prohibited its being loaded with the LADDER semantic grammar: the available memory space was exhausted before the grammar could be fully defined. Although eventually a method was worked out whereby the semantic grammar could be loaded into DIAMOND, the resulting system was not tested due to its non-standard mode of operation, and because the working space left over for parsing was minimal. Therefore, the results and discussion will include data for only five combinations of parser and grammar.</Paragraph>
    </Section>
    <Section position="3" start_page="2" end_page="2" type="sub_section">
      <SectionTitle>
Linguistic Grammar
</SectionTitle>
      <Paragraph position="0"> In terms of the number of grammar rules found applicable by the parsers, DIAMOND instantiated the fewest (averaging 58 phrases per sentence); CKY, the most (121); and LIFER fell in between (IO7). LIFER makes copious use of CONS cells for internal processing purposes, and thus required the most storage (averaging 5294 CQNSes per parsed sentence); DIAMOND required the least (llO7); CKY fell in between (1628). But in terms of parse time, CKY was by far the best (averaging .386 seconds per sentence, exclusive of garbage collection); DIAMOND was next best (.976); and LIFER was worst (2.22). The total run time on the SRI-KL machine for the batch jobs interpreting the linguistic grammar (i.e., 'pure' parse time plus all overhead charges such as garbage collection, I/O, swapping and paging) was 12 minutes, 50 seconds for LIFER, 7 minutes, 13 seconds for DIAMOND, and 3 minutes 15 seconds for CKY. The surprising indication here is that, even though CKY proposed more phrases than its competition, and used more storage than DIAMOND (though less than LIFER), it is the fastest parser. This is true whether considering successful or unsuccessful analysis attempts, using the linguistic grammar.</Paragraph>
    </Section>
    <Section position="4" start_page="2" end_page="2" type="sub_section">
      <SectionTitle>
Semantic Grammar
</SectionTitle>
      <Paragraph position="0"> We will now consider the corresponding data for CKY vs.</Paragraph>
      <Paragraph position="1"> LIFER using the semantic grammar (remembering that DIAMOND was not testable in this configuration). In terms of the number of phrases per parsed sentence, CKY averaged five times as many as LIFER (151 compared to 29). In terms of storage requirements CKY was better (averaging 1552 CONSes per sentence) but LIFER was only slightly worse (1498). But in CPU time, discounting garbage collection, CKY was again significantly faster than LIFER (averaging .286 seconds per sentence compared to .635). The total run time on the SRI-KL machine for the batch jobs interpreting the semantic grammar (i.e., &amp;quot;pure&amp;quot; parse time plus all overhead charges such as garbage collections, I/O, swapping and paging) was 5 minutes, IO seconds for LIFER, and 2 minutes, 56 seconds for CKY. As with the linguistic grammar, CKY was significantly more efficient, whether considering successful or unsuccessful analysis attempts, while using the same grammar and analyzing the same sentences.</Paragraph>
      <Paragraph position="2">  Three follow-up mini-experiments were conducted. The number of sentences was relatively small (a few dozen), and the results were not permanently recorded, thus they are reported here as anecdotal evidence. In the first, CKY and LIFER were compared in their natural modes of operation -- that is, with CKY finding all interpretations and LIFER fCnding the first -- using both grammars but just a few sentences. This was in response to the hypothesis that forcing LIFER to derive all interpretations is necessarily unfair. The results showed that CKY derived all interpretations of the sentences in slightly less time than LIFER found its first.</Paragraph>
      <Paragraph position="3"> The discovery that DIAMOND appeared to be considerably less efficient than CKY was quite surprising.</Paragraph>
      <Paragraph position="4"> Implementing the same algorithm, but augmented with the phrase-limiting &amp;quot;oracle&amp;quot; and special assembly code for efficiency, one might expect DIAMOND to be faster than CKY. A second mini-experiment was conducted to test the ntost likely explanation -- that the overhead of DIAMOND's oracle might be greater than the savings it produced. The results clearly indicated that DIAMOND was yet slower without its oracle.</Paragraph>
      <Paragraph position="5"> The question then arose as to whether CKY might be yet faster if it too were similarly augmented. A top-down filter modification was soon implemented and another small experiment was conducted. Paradoxically, the effect of filtering in this instance was to degrade performance. The overhead incurred was greater than the observed savings. This remained a puzzlement, and eventually helped to inspire the LRC experiment.</Paragraph>
    </Section>
  </Section>
  <Section position="3" start_page="2" end_page="3" type="metho">
    <SectionTitle>
THE LRC EXPERIMENT
</SectionTitle>
    <Paragraph position="0"> In this section we discuss the experiment conducted at the Lingui~icsResearch Center. First, the parsers and their strategy variations are described and ~ntuitively compared; second, the grammar is described in terms of its purpose and its coverage; third, the sentences employed in the comparisons are discussed with regard to their source and presumed generality; next, the methods of comparing performance are discussed; finally, the  results are presented.</Paragraph>
    <Paragraph position="1"> The Parsers and Strategies One of the parsers employed in the LRC experiment was the CKY parser. The other parser employed in the LRC experiment is a left-corner parser, inspired again by Chester \[1980\] but programmed from scratch by the author. Unlike a Cocke-Kasami-Younger parser, which indexes a syntax rule by its right-most constituent, a left-corner parser indexes a syntax rule by the left-most constituent in its right-hand side. Once the parser has found an instance of the left-corner constituent, the remainder of the rule can be used to predict what may come next. When augmented by top-down filtering, this parser strongly resembles the Earley algorithm \[Earley, Ig70\].</Paragraph>
    <Paragraph position="2"> Since the small-scale experiments with top-down filtering at SRI had revealed conflicting results with respect to DIAMOND and CKY, and since the author's intuition continued to argue for increased efficiency in conjunction with this strategy despite the empirical evidence to the contrary, it was decided to compare the performance of both parsers with and without top-down filtering in a larger, more carefully controlled experiment. Another strategy variation was engendered during the course of work at the LRC, based on the style of grammar rules written by the linguistic staff. This strategy, called &amp;quot;early constituent tests,&amp;quot; is intended to take advantage of the extent of testing of individual constituents in the right-hand-sides of the rules. Normally a parser searches its chart for contiguous phrases in order as specified by the right-hand-side of a rule, then evaluates the rule-body procedures which might reject the application due to a deficiency in one of the r-h-s constituent phrases; the early constituent test strategy calls for the parser to evaluate that portion of the rule-body procedure which tests the first constituent, as soon as it is discovered, to determine if it is acceptable; if so, the parser may proceed to search for the next constituent and similarly evaluate its test. In addition to the potential savings due to earlier rule rejection, another potential benefit arises from ATN-style sharing of individual constituent tests among such rules as pose the same requirements on the same initial sequence of r-h-s constituents. Thus one test could reject many apparently applicable rules at once, early in the search -- a large potential savings when compared with the alternative of discovering all constituents of each rule and separately applying the rule-body procedures, each of which might reject (the same constituent) for the same reason. On the ocher hand, the overhead of invoking the extra constituent tests and saving the results for eventual passage to the remainder of the rule-body procedure will to some extent offset the gains.</Paragraph>
    <Paragraph position="3"> It is commonly considered that the Cocke-Kasami-Younger algorithm is generally superior to the left-corner algorithm in practical application; it is also thought that top-filtering is beneficial. But in addition C/o intuitions about the performance of the parsers and strategy variations individually, there is the issue of possible interactions between them. Since a significant portion of the sentence analysis effort may be invested in evaluating the rule-body procedures, the author's intuition argued that the best cond}inatlon could be the left-corner parser augmented by early constituent tests and top-down filtering -- which would seem to maximally reduce the number of such procedures evaluated.</Paragraph>
    <Paragraph position="4"> The Grammar The grammar employed during the LRC experiment was the German analysis grammar being developed at the LRC for * use in Machine Translation \[Lehmann et el., 1981\].</Paragraph>
    <Paragraph position="5"> Under development for about two years up to the time of the experiment, it had been tested on several moderately large technical corpora \[Slocum, Ig80\] totalling about 23,000 words. Although by no means a complete grammar, it was able to account for between 60 and gO percent of the sentences in the various texts, depending on the incidence of problems such as highly unusual constructs, outright errors, the degree of complexity in syntax and semantics, and on whether the tests were conducted with or without prior experience with the text. The broad range of linguistic phenomena represented by this material far outstrips that encountered in most NLP systems to date. Given the amount of text described by the LRC German grammar, it may be presumedto operate in a fashion reasonably representative of the general grammar for German yet to be writtendeg</Paragraph>
    <Section position="1" start_page="3" end_page="3" type="sub_section">
      <SectionTitle>
The Sentences
</SectionTitle>
      <Paragraph position="0"> The sentences employed in the LRC experiment were extracted from three different technical texts on which the LRC MT system had been previously tested. Certain grammar and dictionary extensions based on those tests, however, had not yet been incorporated; thus it was known in advance that a significant portion of the sentences might not be analyzed. Three sentences of each length were randomly extracted from each text, where possible; not all sentence lengths were sufficiently represented to allow this in all cases.</Paragraph>
      <Paragraph position="1"> The 262 sentences ranged in length from 1 to 39 words, averaging 15.6 words each -- twice as long as the sentences employed in the SRI experiments.</Paragraph>
    </Section>
    <Section position="2" start_page="3" end_page="3" type="sub_section">
      <SectionTitle>
Methods of Comparison
</SectionTitle>
      <Paragraph position="0"> The LRC experiment was intended to reveal more of the underlying reasons for differential parser performance, including strategy interactions; thus it was necessary to instrument the systems much more thoroughly. Data was gathered for 35 variables measuring various aspects of behavior, including general information (13 variables), search space (8 variables), processing time (7 variables), and mamory requirements (7 variables).</Paragraph>
      <Paragraph position="1"> One of the simpler methods measured the amount of time devoted to storage management (garbage collection in INTERLISP) in order to determine a &amp;quot;fair&amp;quot; measure of CPU time by pro-rating the storage management time according to storage used (CONSes executed); simply crediting garbage collect time to the analysis of the sentence immediately at hand, or alternately neglecting it entirely, would not represent a fair distribution of costs. More difficult was the problem of measuring search space. It was not felt that an average branching factor computed for the static grammar would be representative of the search space encountered during the dynamic analysis of sentences. An effort was therefore made to measure the search space actually encountered by the parsers, differentiated into grammar vs. chart search; in the former instance, a further differentiation was based on whether the grammar space was being considered from the bottom-up (discovery) vs. top-down (filter) perspective. Moreover, the time and space involved in analyzing words and idioms and operating the rule-body procedures was separately measured in order to determine the computational effort expended by the parser proper. For the experiment, the translation process was short-circuited; thus only analysis, not transfer and synthesis, was performed.</Paragraph>
      <Paragraph position="2"> Summary of the LRC Experiment The LRC experiment involved the production of eight different instrumented systems -- two parsers (leftcorner and Cocke-Kasami-Younger), each with all four combinations of two independent strategy variations (top-down filtering and early constituent tests)-- and eight test runs on the identical set of 262 sentences selected pseudo-randemly from three technical texts supplied by the MT project sponsor. The sentences contalned therein may reasonably be expected to constitute a nearly-representative sample of text in that domain, and presumably constitute a somewhat less-representative (but by no means trivial) sample of the types of syntactic structures encountered in more general German text. The usual (i.e., complete) analysis procedures for the purpose of subsequent translation were in effect, which includes production of a full syntactic and semantic analysis via phrase-structure rules, feature tests and operations, transformations, and case frames. It was known in advance that not all constructions would be handled by the grammar; further, that for some sentences some or all of the parsers would exhaust the available space before achieving an analysis. The latter problem in particular would indicate differential performance characteristics when working with limited memory. One of the parsers, the version of the CKY parser lacking both top-down filtering and early constituent tests, is Qssentially identical to the CKY parser employed in the SRI experiments. The experiment was conducted on a DEC 2060; the systems were run separately, late at night in order to minimize competition with other programs which could confound the results.</Paragraph>
      <Paragraph position="3"> The Experimental Results The various parser and strategy combinations were s!igl~tly u-,~ual in their ability to analyze (or, alternate~y, de~ ~trate the ungran~naticality of) sentences within the available space. Of the three strategy choices (parser, filtering, constituent tests), filtering constituted the most effective discriminant: the four systems with top-down filtering were 4% more likely to find an interpretation than the four without; but most of this diiference occurred within the systems employing the left-corner parser, where the likelihood was IO% greater. The likelihood of deriving an interpretation at all is a matter that must be considered when contemplating application on machines with relatively limited address space. The summaries below, however, have been balanced to reflect a situation in which all systems have sufficient space to conclude the analysis effort, so that the comparisons may be drawn on an equal basis.</Paragraph>
      <Paragraph position="4"> Not surprisingly, the data reveal differences between single strategies and between joint strategies, but the differences are sometimes much larger than one might suppose. Top-down filtering overall reduced the number of phrases by 35%, but when combined with CKY without early constituent tests the difference increased to 46%.</Paragraph>
      <Paragraph position="5"> In the latter case, top-down filtering increased the overall search space by a factor of 46-- to well over 300,000 nodes per sentence. For the Left-Corner Parser without early constituent tests, the growth rate is much milder -- an increase in search space of less than a factor- of 6 for a 42% reduction in the number of phrases -- but the original (unfiltered)search space was over 3 times as large as that of CKY. CKY overall required 84% fewer CONSes than did LCP (considering the parsers alone); for one matched pair of joint strategies, pure LCP required over twice as much storage as pure CKY.</Paragraph>
      <Paragraph position="6"> Evaluating the'parsers and strategies via CPU time is a tricky business, for one must define and justify what is to be included. A common practice is to exclude almost everything (e.g., the time spent in storage management, paging, evaluating rule-body procedures, building parse trees, etc.). One commonly employed ideal metric is to count the number of trips through the main parser loops.</Paragraph>
      <Paragraph position="7"> We argue that such practices are indefensible. For instance, the &amp;quot;pure parse times&amp;quot; measured in this experiment differ by a factor of 3.45 in the worst case, but overall run times vary by 46% at most. But the important point is that if one chose the &amp;quot;best&amp;quot; parser on the basis of pure parse time measured in this experiment, one would have the fourth-best overall system; to choose the best overall system, one must settle for the &amp;quot;sixth-best&amp;quot; parser! Employing the loopcounter metric, we can indeed get a perfect prediction of rank-order via pure parse time based on the inner-loop counters; what is more, a formula can be worked out to.predict the observed pure parse times given the three loop counters. But such predictions have already been shown to be useless.(or worse) in predicting total program runtime. Thus in measuring performance we prefer to include everything one actually pays for in the real computing world: Paging, storage management, building interpretations, etc., as well as parse time.</Paragraph>
      <Paragraph position="8"> In terms of overall performance, then, top-down filtering in general reduced analysis times by 17% (though it increased pure parse times by 58%); LCP was 7% less time-consuming than CKY; and early constituent tests lost by 15% compared to not performing the tests early.</Paragraph>
      <Paragraph position="9"> As one would expect, the joint strategy LCP with top-down filtering \[ON\] and Late (i.e. not Early) Constituent Tests \[LCT\] ranked first among the eight systems. However, due to beneficial interactions the joint strategy \[LCP ON ECT\] (which on intuitive grounds we predicted would be most efficient) came in a close second; \[CKY ON LCT\] came in third. The remainder ranked as follows:</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="3" end_page="3" type="metho">
    <SectionTitle>
\[CKY OFF LCT\], \[LCP OFF LCT\], \[CRY ON ECT\], \[CKY OFF
</SectionTitle>
    <Paragraph position="0"> ECT\], \[LCP OFF ECT\]. Thus we see that beneficial interaction with ECT is restricted to \[LCP ON\].</Paragraph>
    <Paragraph position="1"> Two interesting findings are related to sentence length.</Paragraph>
    <Paragraph position="2"> One, average parse times (however measured) do not exhibit cubic or even polynomial behavior, but instead appear linear. Two, the benefits of top-down filtering are dependent on sentence length; in fact, filtering is detrimental for shorter sentences. Averaging over all other strategies, the break-even point for top-down filtering occurs at about 7 words. (Filtering always increases pure parse time, PPT, because the parser sees it as pure overhead. The benefits are only observable in overall system performance, due primarily to a significant reduction in the time/space spent evaluating rule-body procedures.) With respect to particular strategy combinations, the break-even point comes at about lO words for \[LCP LCT\], 6 words for \[CKY ECT\], 6 words for \[LCP LCT\], and 7 words for \[LCP ECT\]. The reason for this length dependency becomes rather obvious in retrospect, and suggests why top-down filtering in the SRI follow-up experiment was detrimental: the test sentences were probably too short.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML