XML Viewer - a92-1027

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/92/a92-1027_metho.xml
Size: 35,269 bytes
Last Modified: 2025-10-06 14:12:53
<?xml version="1.0" standalone="yes"?>
<Paper uid="A92-1027">
  <Title>An Efficient Chart-based Algorithm for Partial-Parsing of Unrestricted Texts</Title>
  <Section position="3" start_page="194" end_page="194" type="metho">
    <SectionTitle>
2. Test Results
</SectionTitle>
    <Paragraph position="0"> We put SPARSER and its First large grammar through a substantial test at the end May 1991. The task was to extract information on people changing jobs. The articles were from the Wall Street Journal, as downloaded off the Dow Jones News service; the example below is a faithful reproduction of what one of those articles looks like as the news service provides them.</Paragraph>
    <Paragraph position="1"> The test consisted of 203 articles; literally the second half of all articles that the Journal published in February 1991 whose electronic version had the tag &amp;quot;WNEWS&amp;quot;. They included long columns on advertising and law that mentioned a job change incidentally, and some feature articles. About two thirds were from the Journal's &amp;quot;Who's News&amp;quot; column, where the article below is a typical example. It is the first article from the test set.</Paragraph>
    <Paragraph position="2">  Note that this is now &amp;quot;semantics&amp;quot; in the sense of finding the denotation of a formula (English phrase) in some model, not in the sense of the choice of labels in a &amp;quot;semantic grammar&amp;quot;. For example, when the parser identifies an NP that is labeled as a &amp;quot;company&amp;quot;, that labeling is syntactic and restricts how the NP can be composed into larger phrases. The denotation of that NP, which SPARSER constructs on the basis of its rules of interpretation, is the particular individual company that the NP refers to, or more precisely, the representation of that company in SPARSER's internal data base.</Paragraph>
  </Section>
  <Section position="4" start_page="194" end_page="196" type="metho">
    <SectionTitle>
TIRES (AUP)
TX GOODYEAR TIRE &amp; RUBBER Co. (Akron,
</SectionTitle>
    <Paragraph position="0"> Ohio) - George R. Hargreaves, vice president and treasurer of Goodyear, will become president and chief executive officer of the Celeron Corp. unit, a holding company for Goodyear's All American Pipeline. Mr.</Paragraph>
    <Paragraph position="1"> Hargreaves, 61, will assume the post effective March 1 and will retain his current posts. Robert W. Milk, Celeron's current president and chief executive, as well as an executive vice president for Goodyear, will be on special assignment until he retires April 30.</Paragraph>
    <Paragraph position="2"> The task was to extract relations (database tuples), such as the one below, that give the action, person or persons affected, the position (title), and the company or subsidiary. In the text, this corresponds to each clause with a relevant verb, and their variants in reduced clauses, conjunctions, relatives, lists, etc. (though by convention it does not include the appositives, since they give current information rather than changes). It also included redundant instances, such as nominalizations or anaphoric references (the post). There are four instances in this article, of which SPARSER found three, missing the meaning of his current posts because of rule interference with a recently changed definition for current. The example below is the fLrst of those relations.</Paragraph>
    <Paragraph position="3">  #&lt;edge75 80 Job-event 105 event : #&lt;event-type become-title&gt; title: (#&lt;title ~president&amp;quot; #&lt;title &amp;quot;chief executive officer&amp;quot;&gt;) person: #&lt;person Hargreaves, George R.&gt;  Overall, SPARSER found 81% of all the possible jobchange relations (597/735). Within the relations that it found, 81.5% of them had all of their fields fdled with the correct values (486/597). The false positive rate was 3% (19/616). Given the limited size of the test set (203 articles, 735 possible relations), a better way to state these results is that the system found 4/5ths of the relations, and that 4/5ths of those were correct in all respects. Most of the deficits in precision were due to failing to find a value for a field, rather than filling it with an incorrect value; the number of relations with an actual mistake in one field was 6% (36/597). Roughly three man months went into preparing the grammar. 3 The development corpus was the articles from It is not very informative to report that there were 2,092 rewrite rules in the grammar on the day of the test, since this number includes the definition of 40 individual years, the 12 months and their abbreviations, upper and lowercase forms of most of the words, etc. The number also omits the grammar for proper names and for numbers, since these are organized on a quite different basis. To give some idea of its relevant size, we can point out that it supported 12 topic-specific verb subcategorization frames and 31 topic-specific verbs, 25 title heads and 30 title modifiers, and that about 25% of the 244 mistakes counted in the test could be attributed missing some  December 1990 and from the first half of February.</Paragraph>
    <Paragraph position="4"> Probably an additional two months would have been required to bring the grammar up to full competence on that corpus; entire classes of constructions that were known to be relevant were not implemented at the time of the test, including definite descriptions of rifles acting as people (five vice presidents were ...), and conjuncts that did not have objects of identical type directly on each side of the conjunction. The grammar overall does, however, have reasonable competence in definite references and pronouns, participles, relative clauses, and appositives.</Paragraph>
    <Paragraph position="5"> Its accuracy, especially in such areas as pp-attachment, stems from its use of semantic rather than strictly syntactic terms iri its rules. This means that a non-triviai amount of extension is required for each new topic area that a grammar is written for, though much of what is needed will be analogous to what is already in place, and the syntactic base, with its treatments auxiliaries, determiners, relative pronouns, etc. can be carried forward unchanged.</Paragraph>
    <Paragraph position="6"> As a program, Sparser is quite robust. In other applications the system has been run continuously without error for more than 30 hours, and it has handled magazine articles more than forty thousand words long.</Paragraph>
    <Paragraph position="7"> We will now look at Sparser's algorithm for phrase structure parsing. We begin by introducing the rationale for the algorithm by comparing it with other common phrase structure parsing methods. Then in SS4 we look at the tokenizer, the chart, and the phrase structure rules, finally moving to the details of the algorithm and examples of its use.</Paragraph>
    <Paragraph position="8"> Unfortunately space does not permit more than a passing mention of the other parsing algorithms SPARSER uses in conjunction with phrase structure parsing; a brief precis of these companion parsing techniques can be found in McDonald (1990).</Paragraph>
    <Paragraph position="9"> 3. Placing the phrase structure algorithm in context Sparser forms its analysis in one pass through the text from left to right (beginning to end). The backbone of the analysis is a set of context free and context sensitive phrase structure rewrite rules of the usual sort. These rules are applied to the text to form edges (parse nodes) over the terminal words and other edges-their daughter constituents. The final set of maximal, connected edges constitutes the parser's analysis of the text's form and linguistic relations. A parallel set of projected denotations for these edges in the designated domain model constitutes Sparser's analysis of the text's meaning, as briefly sketched in SS 1.3.</Paragraph>
    <Section position="1" start_page="195" end_page="196" type="sub_section">
      <SectionTitle>
3.1 Standard phrase structure algorithms
</SectionTitle>
      <Paragraph position="0"> Phrase structure parsing can be seen as a kind of search.</Paragraph>
      <Paragraph position="1"> One looks for the best analysis of the text by searching the space of possible analyses permitted by the grammar to see which one best describes the derivation of the text. To be sure of arriving at the correct analysis, the search must be thorough enough to ensure that no valid analysis is missed.</Paragraph>
      <Paragraph position="2"> At the same time, the search space should be as small as possible to ensure efficiency.</Paragraph>
      <Paragraph position="3"> In considering efficiency, we must trade off the simplicity of the control structure against the amount or complexity of the state information that the algorithm calls for. The simplest control algorithm is probably the nested loops of the CKY algorithm (see, e.g., Aho &amp; Ullman 1972). This algorithm searches for parse nodes of all PoSsible word lengths and starting positions. It looks through all legal values of three indices, 0 _&lt; i &lt; j &lt; k _&lt; n (where n is the length of the input text), to determine whether two adjacent candidate daughters, one spanning the text from index i to index j and the other from j to k, can be combined to form a new node from i to k. This algorithm takes only a few lines to write, but since it is driven by the space of index values, it necessarily requires On3 time to complete its search, along with potentially n2/2 storage cells to record its intermediate results.</Paragraph>
      <Paragraph position="4"> Other familiar algorithms reduce the search space by, in effect, only looking at those points in the space where there is guaranteed to be something to see. They pay for this in a more elaborate control structure. Using Earley's algorithm (1970) as the model, we can summarize their procedures as typically f'wst predicting, top-down, what constituents could legally occur given the rules of the grammar. Then, as they sequentially scan the terminals of the text, either from the left end or the right, they incrementally confirm some of these predictions by completing hypothesized rules bottom up as all of a rule's daughter constituents are found. With common grammars (bounded state), Earley's algorithm runs in order of n time, but at a cost in storage potentially as great as 62, where G is the number of rules in the grammar.</Paragraph>
      <Paragraph position="5"> The bulk of this storage cost in Earley's algorithm is due to its representation of the predictions, i.e. a listing of all of the potentially completable rules that is modified as each terminal is scanned and edges are completed. An active-edge chart algorithm (Kay 1980, Kaplan 1973; a good textbook treatment can be found in Winograd 1983) has a comparable storage cost because it also maintains an online representation of the production rules that are relevant to the analysis, though the particulars of how it represents these partially instantiated rules are quite different from Earley's given the differences in their control structures.</Paragraph>
      <Paragraph position="6"> In a parser designed for unrestricted, multi-paragraph text, like Sparser, there are problems with using any explicit run-time representation of potentially completable rules. The near-inevitability that the sentences will be broken up by unknown words means that one cannot assume that all ot the root edges in the final forest will be labeled with &amp;quot;S&amp;quot;. As a result, one must include in the set of starting labels fol the predictions essentially all of the lefthand-side labels in the grammar's rule set. (The treatment in Martin, Church &amp; Patil 1981 handles this &amp;quot;all predictions at all vertexes&amp;quot; problem very elegantly.) Given that Sparser presently contains approximately 300 non-terminal labels in its semantic grammar, any algorithm with an order of G storage cos1 would be prohibitively expensive.</Paragraph>
      <Paragraph position="7"> rewrite rule and 20% to missing vocabulary (8% to the single case hold &lt;position&gt;).</Paragraph>
    </Section>
    <Section position="2" start_page="196" end_page="196" type="sub_section">
      <SectionTitle>
3.2 Introduction to SPARSER's algorithm
</SectionTitle>
      <Paragraph position="0"> To predict everything, however, is to constrain nothing, and so the natural alternative is of course to form phrases bottom up, using only the &amp;quot;scan&amp;quot; and &amp;quot;complete&amp;quot; aspects of the basic algorithm. SPARSER uses a bottom-up parsing algorithm for its phrase structure rules. All of the edges in its chart are what would be called &amp;quot;inactive&amp;quot; edges in the above approaches--they all represent actual constituents in the text rather hypothesized ones.</Paragraph>
      <Paragraph position="1"> Without the constraint provided by prediction that every edge will be used in the final analysis, a conventional bottom-up algorithm suffers from two kinds of problems.</Paragraph>
      <Paragraph position="2"> * Locally correct but globally misaligned edges can result in additional, unconnected edges that will not be part of the final, maximal analysis.</Paragraph>
      <Paragraph position="3"> * The very same combination of constituents may be parsed in several different orders, resulting in multiple, &amp;quot;spurious&amp;quot; edges covering the same span and with the same meaning, where only one is needed.</Paragraph>
      <Paragraph position="4"> Sparser addresses the problem of misalignment by forcing its its initial, &amp;quot;segment by segment&amp;quot; parsing to conform to a linguistically-motivated &amp;quot;grid&amp;quot; that is formed from phrase boundary information taken from the location of closed-class words; see SS4.4.</Paragraph>
      <Paragraph position="5"> Sparser addresses the spurious edge problem by drastically restricting its search space of adjacent edges. When a new edge is entered and checked against its adjacent neighbor edges for possible completion of a new edge, only the single &amp;quot;topmost&amp;quot; neighbor edge at a position is checked, rather than all of the edges that have accumulated at that position as is customary in a bottom-up algorithm. This topmost edge will be the one most recently entered into the chart, and it will be the longest thus far to start/end at that position.</Paragraph>
      <Paragraph position="6"> This reduction in the search space by checking against only topmost edges can dramatically lower the number of checks made and edges entered. The exact amounts vary with the grammar and the text. The greatest savings comes in conjunctions or in cases where a head labeled with a recursive category can take several complements from either its left or right (e.g. a verb phrase taking auxiliaries to its left and optional adjuncts to its right)--constructions that are ubiquitous in news articles. The savings is multiplicative: If there are m complements to the left of the head and n complements to the right, and if each composition of a complement and its accumulated head+complements neighbor phrase yields a new edge with the same label as the head, then the number of edges formed if all neighbors are checked is ra*n. If only the top neighbor is checked the number is m+n.</Paragraph>
      <Paragraph position="7"> A further reduction in the search space is achieved through Sparser's use of a semantic grammar. Each preterminal edge for a open class word will have a semantic classification (label) corresponding to the kind of thing it denotes (see SS 1.3). If a word is ambiguous it will introduce multiple edges, one for each interpretation. By writing the rules for phrasal composition in terms of these classifications, we insure that no phrase will be formed by the parser unless it has a semantic interpretation. This cuts down dramatically on the amount of structural ambiguity that the parser must cope with; indeed, in nearly all cases examined so far, polysemous words have been disambiguated by the first rule that applies to them to form a larger phrase. Prepositional attachment, in particular, has not proved a problem since one is not adding a PP, waiting for a later semantic interpretation process to rule on whether the combination makes sense, but instead adding a phrase labeled, e.g., &amp;quot;for-company&amp;quot;, whose rules of combination are markedly more specific. null</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="196" end_page="199" type="metho">
    <SectionTitle>
4. The Details of the Algorithm
</SectionTitle>
    <Paragraph position="0"> We will now look at particulars of the scan routine, the chart, and the phrase structure rules, and then move on to describe the phrase structure parsing algorithm in the context of a short example. Overall, Sparser is a transducer taking as input a stream of ascii characters and producing as output (a) a recycled chart of completed edges with their denotations in the domain model, and (b) a sequence of user-specified actions triggered at hooks embedded within the core algorithms such as the completion of an edge with a given label.</Paragraph>
    <Paragraph position="1"> One use of these hooks/actions has been to collect, e.g., the maximal job-event edges in an article so that they can be readout into the data base. We will not discuss them further in this paper.</Paragraph>
    <Section position="1" start_page="196" end_page="197" type="sub_section">
      <SectionTitle>
4.1 Operations over terminals
</SectionTitle>
      <Paragraph position="0"> At the base of the purser's operation is a scan operation that identifies (delimits) minimal tokens within the input character stream, consuming it in the process, and adding edges to the chart. Phrase boundary brackets are also entered into the chart when function words are scanned, as described in SS4.4.</Paragraph>
      <Paragraph position="1"> As each token is delimited within the stream, it is looked up in the word list. If it is known, the prefer'meal object that represents it (a &amp;quot;word&amp;quot;) is entered into the chart, along with any edges dictated by the grammar. If it is unknown--new to the parser--a word object is constructed for it, and its string examined to characterize its morphological and capitalization properties. Tokens are minimal sequences of the same character type, e.g. the sequence &amp;quot;$43.3 million&amp;quot; is seen as six tokens: &amp;quot;$&amp;quot;, &amp;quot;43&amp;quot;, &amp;quot;.', &amp;quot;2&amp;quot;, &lt;one space&gt;, &amp;quot;million&amp;quot;. All larger combinations are formed through phrase structure or other sorts of rules.</Paragraph>
      <Paragraph position="2"> In addition to the introduction of preterminal edges, nonphrase structure rules of various sorts may be associated with tokens and are executed as the tokens are scanned.</Paragraph>
      <Paragraph position="3"> These include * simple &amp;quot;polyword&amp;quot; rules that interpret a sequence of terminals as a single, inseparable entity (e.g. &amp;quot;holding company&amp;quot;, &amp;quot;Wall S~eet Journal&amp;quot;); * rules for forming constituents on the basis of paired punctuation such as parentheses or brackets, or for special conventions such as SGML tags; * complex rules for the formation of constituents with &amp;quot;flat&amp;quot;, Kleene star -style internal structures, in particular proper names; and  * arbitrary actions outside the parser's scope, e.g. to do word counts, or to feed a topic detection algorithm that does not use Sparser's later stages.</Paragraph>
      <Paragraph position="4"> A proper discussion of the algorithms for these operations and their integration into the parsing algorithm as a whole is beyond the scope of this paper. Suffice it to say that once triggered they execute as self-contained processes, and that their results are always recorded as one or more edges that they add to the chart, spanning the appropriate amount of text.</Paragraph>
    </Section>
    <Section position="2" start_page="197" end_page="197" type="sub_section">
      <SectionTitle>
4.2 The Chart
</SectionTitle>
      <Paragraph position="0"> Sparser's chart is comprised of three kinds of data structures: positions, edges, and edge-vectors. Positions provide indices to record the sequence of the terminals and indicate the spans of the edges. They correspond to the &amp;quot;vertices&amp;quot; between edges as used in other chart algorithms, but here they are first class objects with their own primary definition in terms of the sequence of terminals, rather than being dependent on the notion of edges. Following the usual convention, positions are located between the terminals.</Paragraph>
      <Paragraph position="1"> From the point of view of the parsing algorithm, there is unlimited stream of positions, starting with the position with index zero that precedes the dummy terminal representing the start of the text, and continuing terminal by terminal until the tokenizer has exhausted the input character stream.</Paragraph>
      <Paragraph position="2"> The implementation is actually in terms of a fixed length array filled with position objects. This array grounds the notion of successive positions. The Scan operation will make the array wrap around and write over earlier position objects as needed when the length of the text exceeds the length of the array. The utility of this fixed, recycled resource is that it allows SPARSER to handle texts of arbitrary length, so long as the array is longer than the longest span of terminals over which some adjacency-driven phrase structure rule is expected to apply. A length of 250 has proved more than adequate in the Who's News domain.</Paragraph>
      <Paragraph position="3"> Edges represent the completion of rules, or mark the presence of terminals that are mentioned as literals in some rule. An edge has a label, which will be the lefthand-side term of the corresponding rule, and records the constituent edges (or single terminal/edge) that it spans. An edge's daughter edges can be readout recursively as a parse tree marking the sequence (derivation) of rules that constitutes the grammar's analysis of the sequence of terminals the edge spans. Like positions, edges are implemented as a recycled resource; the customary number of edge objects is 500.</Paragraph>
      <Paragraph position="4"> Edge-vectors link positions to edges. Each position has two edge vectors, one recording the edges that end at that position, the other the edges that begin at that position.</Paragraph>
      <Paragraph position="5"> The edges in each vector are sorted historically: the first edge in a vector will be the first edge to have been introduced in to the chart that ended/started at that position; the last edge will be the most recent. The most recently introduced edge is referred to at the &amp;quot;top&amp;quot; edge to end/start at the position; this edge is pointed to directly by the edge vector object because of its importance to the parsing algorithm. Given the nature of the parsing algorithm it will also be the longest edge to end/start at the position.</Paragraph>
    </Section>
    <Section position="3" start_page="197" end_page="197" type="sub_section">
      <SectionTitle>
4.3 The phrase structure grammar
</SectionTitle>
      <Paragraph position="0"> The phrase structure grammar consists of a set of rewrite rules. The rules define patterns of labeled adjacent immediate constituen~ in the usual manner. The labels are either literal words (or any other sort of token such as punctuation or in some cases even whitespace), or they are atomic category symbols.</Paragraph>
      <Paragraph position="1"> Shown below are some of the rules used in the analysis of the sample article, given in the usual notation as terms on the left and righthand-sides of an arrow. The righthand side terms are the labels on immediate constituents; the left-hand term will be their parent, labeling any edge formed by the completion of that rule.</Paragraph>
      <Paragraph position="2">  As part of not needing to support an &amp;quot;active edge&amp;quot; representation of partially complete rules during runtime, Sparser's basic operation, which we can call &amp;quot;check&amp;quot;, is defined over a pair of adjacent edges. A table is consulted to see if there is some rule (or dotted expansion of a rule, see below) that lists the labels of those two edges, in order, as its righthand side. If there is such a rule, a new edge is constructed and entered into the chart. If a context free rule is involved, then the edge will span both daughter edges and is labeled with the lefthand side term of the rule. If it is a context sensitive rule, then the designated daughter edge will be respanned and given that label.</Paragraph>
      <Paragraph position="3"> Sparser supports rules with more than two righthand side terms by converting them to a kind of Chomsky Normal Form using a dotted rule rule convention as described by Martin, Church &amp; Patil (1981).</Paragraph>
    </Section>
    <Section position="4" start_page="197" end_page="199" type="sub_section">
      <SectionTitle>
4.4 The parsing algorithm
</SectionTitle>
      <Paragraph position="0"> The phrase structure algorithm divides logically into three processes: (1) delimiting the next segment, (2) parsing the new edges within that segment, and (3) parsing edges across segment boundaries. The control structure treats these as independent processes that signal events, and switches between them as the events dictate. We describe each of these processes in turn, using as our example the portion of the example article excerpted below.</Paragraph>
      <Paragraph position="1"> ... president and chief executive officer of the Celeron Corp. unit, a holding company for Goodyear's All American Pipeline.</Paragraph>
      <Paragraph position="2"> The notion of a &amp;quot;segment&amp;quot; in SPARSER is a sequence o! terminals between a matching set of phrase boundary brackets that are introduced into the chart by closed class words ot by known open-class words from the domain vocabulary,</Paragraph>
      <Paragraph position="4"> Just below is the excerpt with its brackets. The initial bracket was introduced by the just-preceding known verb, &amp;quot;become&amp;quot;, the other brackets were introduced by the function words/punctuation/affixes: &amp;quot;and&amp;quot;, &amp;quot;of', &amp;quot;the&amp;quot;, &amp;quot;,&amp;quot;, &amp;quot;a'', &amp;quot;for', &amp;quot;'s&amp;quot;, and &amp;quot;.&amp;quot;.</Paragraph>
      <Paragraph position="5"> \[ president \] and \[ chief executive officer \] of \[ the Celeron Corp. unit \] , \[ a holding company \] for \[ Goodyear \] 's \[ All American Pipeline \] . \] The idea of segmenting a text on the basis of its closed class words is an old one. A recent, comparably systematic system where closed class words are used is described by O'Shaughnessy (1989). And it appears that something like this scheme is used in Hindle's FIDDITCH parser (partially described in Hindle 1983).</Paragraph>
      <Paragraph position="6"> The segment delimiter starts at the last position where a segment terminated (or initially at position 0). It makes successive calls to Scan, adding words and their immediate pre-terminal edges to the chart and running any of the nonphrase structure parsing processes that the words trigger.</Paragraph>
      <Paragraph position="7"> This processes stops when a word is scanned that introduces a close bracket (&amp;quot;\]&amp;quot;). At this point control is passed the second process, to form whatever constituents may be found within the new segment by looking for combinations of the pre-terminal edges.</Paragraph>
      <Paragraph position="8"> When using a normal &amp;quot;all edges&amp;quot; bottom-up algorithm, the criteria for which of the many trees to select is usually to choose the combination that provides the longest consistent account of the text and strands the fewest unattached edges. We mimic that selection criteria online, by having the parser first respect the linguistically motivated boundaries provided by closed class and other known words---parsing within a segment before combining any edges across a segment boundary. And second by respecting the possibility of that the rightmost edge in a segment may be extended by some not-yet-formed edge to its right in the adjacent segments-the algorithm does not allow a rightmost edge to be combined with an edge to its left if the resulting edge would not have the same label and consequently does not have the same possibilities for rightward extensions.</Paragraph>
      <Paragraph position="9"> In terms of the interaction of the three processes, this means first that within-segment parsing is constrained not to permit any combinations of the segment's rightmost edge and its immediate neighbor edge to its left if that would change the possibilities for extending that rightmost edge later through a combination with some edge to its right.</Paragraph>
      <Paragraph position="10"> (This is a trivial check against the grammar tables.) Once the within-segment parsing has finished, the resulting rightmost edge is similarly examined: If it permits rightward combinations then we return to the segmentdelimiting process, and from that to the within-segment parsing process. Once there is finally a segment whose rightmost edge does not have a possible rightward extension, then the across-segment parsing process is allowed to start operating, beginning with the then rightmost edge in the chart overall. 4 As this third process moves leftwards forming successively larger edges, the possibility of rightward extensions is continually checked for, and the segmentdelimiting process re-entered as needed.</Paragraph>
      <Paragraph position="11"> We can see this control structure loop in action by walking through the excerpted text. Let us assume that have reached the point where the segment containing &amp;quot;the Celeron Corp. unit&amp;quot; has just been delimited. The chart will be as shown below. Positions are indicated by their index numbers between and below each of the words. Edges are indicated by half rectangles connecting the positions. The numbers on the edges (e.g. &amp;quot;el&amp;quot;) are for expository purposes only; they reflect the order in which each edge was introduced. The edge labels are not shown. For clarity the edges in the just delimited segment are shown above the text, and those of earlier and later segments below.</Paragraph>
      <Paragraph position="12"> The within-segment parsing process will look for combinations of the edges between position 100 and 105, working rightwards from edge9. Edge9, the preterminal edge over the word &amp;quot;unit&amp;quot;, is labeled &amp;quot;head-of-subsidiary-phrase&amp;quot; in this grammar. There are no rules in the grammar that would extend that edge into a larger edge to its right, and so the process is allowed to look for leftward combinations.</Paragraph>
      <Paragraph position="13"> There are two edges adjacent to the left of edge9.</Paragraph>
      <Paragraph position="14"> Following the restricted search space convention of the algorithm, only the more recent of these, edge7, is checked.</Paragraph>
      <Paragraph position="15"> According to rule number two of the set listed earlier, edge7 and edge9 combine to form a new edge, which will then combined with edge5 according to rule three. There is now one edge spanning all of the segment; if there was a gap, say due to the presence of unknown words in the segment, then heuristic rules would be attempted, as briefly mentioned at the end of SS 1.1.</Paragraph>
      <Paragraph position="16"> 4 In some cases this can mean that an entire sentence is scanned before any across-segment edges are formed. This assumes, of course, that one does not write a grammar rule where period is not the left term of some rule, in which case the scan would continue.</Paragraph>
      <Paragraph position="17">  The new topmost edge over the segment, labeled &amp;quot;subsidiary-company&amp;quot;, does participate in rules that could combine it with an edge to its right, and so the delimiting process is resumed to scan until the next segment is terminated. That segment will contain the words &amp;quot;a holding company&amp;quot;. Within-segment parsing will span the segment with a phrase labeled &amp;quot;company-description&amp;quot;, which in the present grammar takes rightwards extensions and so the the delimiting process is run again. This iterates until the period after &amp;quot;pipeline&amp;quot; is reached, at which point across-segment parsing is finally begun. It rolls up the accumulated edges one after the other from the right. The penultimate composition in this example is the title phrase (edge4) and a &amp;quot;subsidiary-company&amp;quot; phrase spanning all the way from position 99 to position 115 just before the period.</Paragraph>
      <Paragraph position="18"> 5. Conclusions: Why is this efficient ? Given two parsers that employ the same algorithms, the more efficient one will be the one with the most carefully designed and optimized implementation. The two parsers will carry out the same steps (at the aigodthmic level), but one will do them more quickly, consuming less storage, etc.</Paragraph>
      <Paragraph position="19"> From this mechanical point of view Sparser comes off well as compared with other parsing systems that the author is familiar with: A Lisp program, it uses only preallocated storage, which led to a three-fold increase in speed relative to its prior implementation.</Paragraph>
      <Paragraph position="20"> Holding the quality of the implementation constant (and of course the choice of machine on which any tests are made), the greatest increase in efficiency comes from improving the algorithm so that fewer steps are taken. We achieved this in two ways.</Paragraph>
      <Paragraph position="21"> First, we employed a particular technique for reducing the search space through which the parser searched, thereby reducing the number of checks make against the grammar to see whether two edge could be combined, and also reducing the number of edges ever entered into the chart. While we have not yet made a systematic comparison, this technique of checking only the topmost edges at a position appears to result in three to ten times fewer edges ever being formed (depending on the article and the grammar) when compared to an earlier variant of Sparser's algorithm that checking all of the edges.</Paragraph>
      <Paragraph position="22"> Second, we employed a grammar with semantically labeled terms, thereby ensuring that only edges that could receive a valid semantic interpretation would ever be formed.</Paragraph>
      <Paragraph position="23"> This does not cut down on the number of edges checked against the grammar, but it has a dramatic effect on the number of edges ever allowed to be formed in the first place.</Paragraph>
      <Paragraph position="24"> While again we do not have systematic counts (which would effectively require having an entirely new grammar that used only syntactic labels), our impression is that the reduction in ambiguity that the semantic labels brought about had a more significant effect on the number of edges and checks than any variation in the algorithm.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML