File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/99/p99-1061_metho.xml

Size: 23,506 bytes

Last Modified: 2025-10-06 14:15:25

<?xml version="1.0" standalone="yes"?>
<Paper uid="P99-1061">
  <Title>A Bag of Useful Techniques for Efficient and Robust Parsing</Title>
  <Section position="5" start_page="473" end_page="474" type="metho">
    <SectionTitle>
3 Improvements in unification
</SectionTitle>
    <Paragraph position="0"> Unification is the single most expensive operation performed in the course of parsing. Up to 90% of the CPU time expended in parsing a sentence using a large-scale unification based grammar can go into feature structure and type unification. Therefore, any improvements in the efficiency of unification would have direct consequences for the overall performance of the system. null One key to reducing the cost of unification is to find the simplest set of operations that meet the needs of grammar writers but still can be efficiently implemented. The unifier which was part of the original HPSG grammar development system mentioned in the introduction (described by (Backofen and Krieger, 1993)) provided a number of advanced features, including distributed (or named) disjunctions (D6rre and Eisele, 1990) and support for full backtracking.</Paragraph>
    <Paragraph position="1"> While these operations were sometimes useful, they also made the unifier much more complex than was really necessary.</Paragraph>
    <Paragraph position="2"> The unification algorithm used by the current system is a modification of Tomabechi's (Tomabechi, 1991) &amp;quot;quasi-destructive&amp;quot; unification algorithm. Tomabechi's algorithm is based on the insight that unification often fails, and copying should only be performed when the unification is going to succeed. This makes it particularly well suited to chart-based parsing.</Paragraph>
    <Paragraph position="3"> During parsing, each edge must be built without modifying the edges that contribute to it. With a non-backtracking unifier, one option is to copy the daughter feature structures before performing a destructive unification operation, while the other is to use a non-destructive algorithm that produces a copy of the result up to the point a failure occurs. Either approach will result in some structures being built in the course of an unsuccessful unification, wasting space and reducing the overall throughput of the system. Tomabechi avoids these problems by simulating non-destructiveness without incurring the overhead necessary to support backtracking. First, it performs a destructive (but reversible) check that the two structures are compatible, and only when that succeeds does it produce an output structure. Thus, no output structures are built until it is certain that the unification will ultimately succeed.</Paragraph>
    <Paragraph position="4"> While an improvement over simple destructive unification, Tomabechi's approach still suffers from what Kogure (Kogure, 1990) calls redundant copying. The new feature structures produced in the second phase of unification include copies of all the substructures of the input graphs, even when these structures are unchanged. This can be avoided by reusing parts of the input structures in the output structure (Carroll and Malouf, 1999) without introducing significant bookkeeping overhead.</Paragraph>
    <Paragraph position="5"> To keep things as simple and efficient as possible, the improved unifier also only supports conjunctive feature structures. While disjunctions can be a convenient descriptive tool for writing grammars, they are not absolutely necessary. When using a typed grammar formalism, most disjunctions can be easily put into the type hierarchy. Any disjunctions which cannot be removed by introducing new supertypes can be eliminated by translating the grammar into  disjunctive normal form (DNF). Of course, the ratio of the number of rules and lexical entries in the original grammar and the DNFed grammar depends on the 'style' of the grammar writer, the particular grammatical theory used, the number of disjunction alternatives, and so on.</Paragraph>
    <Paragraph position="6"> However, context management for distributed disjunctions requires enormous overhead when compared to simple conjunctive unification, so the benefits of using a simplified unifier outweigh the cost of moving to DNF. For the German and Japanese VERBMOBIL grammars, we got 1.4-3x more rules and lexical entries, but by moving to a sophisticated conjunctive unifier we obtained an overall speed-up of 2-5.</Paragraph>
  </Section>
  <Section position="6" start_page="474" end_page="474" type="metho">
    <SectionTitle>
4 Precompiling Type Unification
</SectionTitle>
    <Paragraph position="0"> After changing the unification engine, type unification now became a big factor in processing: nearly 50% of the overall unification and copying time was taken up by the computation of the greatest lower bounds (GLBs). Although we have in the past computed GLBs online efficiently with bit vectors, off-line computation is of course superior.</Paragraph>
    <Paragraph position="1"> The feasibility of the latter method depends on the number of types T of a grammar. The English grammar employs 6000 types which results in 36,000,000 possible GLBs. Our experiments have shown, however, that only 0.5%2% of the type unifications were successful and only these GLBs need to be entered into the GLB table. In our implementation, accessing an arbitrary GLB takes less than 0.002 msec, compared to 15 msec of 'expensive' bit vector computation (following (A'/t-Kaci et al., 1989)) which also produces a lot of memory garbage.</Paragraph>
    <Paragraph position="2"> Our method, however, does not consume any memory and works as follows. We first assign a unique code (an integer) to every type t E 7-.</Paragraph>
    <Paragraph position="3"> After that, the GLB of s and t is assigned the following code (again an integer, in fact a fixnum): code(s) x ITI + code(t). This arraylike encoding guarantees that a specific code is given away to a GLB at most once. Finally, this code together with the GLB is stored in a hash table. Hence, type unification costs are minimized: two symbol table lookups, one addition, one multiplication, and a hash table lookup.</Paragraph>
    <Paragraph position="4"> In order to access a unique maximal lower bound (= GLB), we must require that the type hierarchy is a lower semilattice (or bounded complete partial order). This is often not the case, but this deficiency can be overcome either by pre-computing the missing types (an efficient implementation of this takes approximately 25 seconds for the English grammar) or by making the online table lookup more complex.</Paragraph>
    <Paragraph position="5"> A naive implementation of the off-line computation (compute the GLBs for T x T) only works for small grammars. Since type unification is a commutative operation (glb(s,t) = glb(t, s); s,t E 7&amp;quot;), we can improve the algorithm by computing only glb(s,t). A second improvement is due to the following fact: if the GLB of s and t is bottom, we do not have to compute the GLBs of the subtypes of both s and t, since they guarantee to fail. Even with these improvements, the GLB computation of a specific grammar took more than 50 CPU hours, due to the special 'topology' of the type hierarchy. However, not even the failing GLBs need to be computed (which take much of the time).</Paragraph>
    <Paragraph position="6"> When starting with the leaves of the type hierarchy, we can compute maximal components w.r.t, the supertype relation: by following the subsumption links upwards, we obtain sets of types, s.t. for a given component C, we can guarantee that glb(s,t) ~ _k, for all s,t E C.</Paragraph>
    <Paragraph position="7"> This last technique has helped us to drop the off-line computation time to less than one CPU hour.</Paragraph>
    <Paragraph position="8"> Overall when using the off-line GLBs, we obtained a parsing speed-up of 1.5, compared to the bit vector computation. 2</Paragraph>
  </Section>
  <Section position="7" start_page="474" end_page="475" type="metho">
    <SectionTitle>
5 Precompiling Rule Filters
</SectionTitle>
    <Paragraph position="0"> The aim of the methods described in this and the next section is to avoid failing unifications by applying cheap 'filters' (i.e., methods that are cheaper than unification). The first filter we want to describe is a rule application filter.</Paragraph>
    <Paragraph position="1"> We have used this method for quite a while, and it has proven both efficient and easy to employ.</Paragraph>
    <Paragraph position="2"> Our rule application filter is a function that 2An alternative approach to improving the speed of type unification would be to implement the GLB table as a cache, rather than pre-computing the table's contents exhaustively. Whether this works well in practice or not depends on the efficiency of the primitive glb(s, t) computation; if the latter were relatively slow then the parser itself would run slowly until the cache was sufficiently full that cache hits became predominant.</Paragraph>
    <Paragraph position="3">  takes two rules and an argument position and returns a boolean value that specifies if the second rule can be unified into the given argument position of the first rule.</Paragraph>
    <Paragraph position="4"> Take for example the binary filler-head rule in the HPSG grammar for German. Since this grammar allows not more than one element on the SLASH list, the left hand side of the rule specifies an empty list as SLASH value. In the second (head) argument of the rule, SLASH has to be a list of length one.</Paragraph>
    <Paragraph position="5"> Consequently, a passive chart item whose top-most rule is a filler-head rule, and so has an empty SLASH, can not be a valid second argument for another filler-head rule application. The filter function, when called with arguments (filler-head-rule-nr, filler-head-rule-nr, 2 ) for mother rule, topmost rule of the daughter and argument position respectively, will return false and no unification attempt will be made.</Paragraph>
    <Paragraph position="6"> The conjunctive grammars have between 20 and 120 unary and binary rule schemata. Since all rule schemata in our system bear a unique number, this filter can be realized as a three dimensional boolean array. Thus, access costs are minimized and no additional memory is used at run-time. The filters for the three languages are computed off-line in less than one minute and rule out 50% to 60% of the failing unifications during parsing, saving about 45% of the parsing time.</Paragraph>
  </Section>
  <Section position="8" start_page="475" end_page="476" type="metho">
    <SectionTitle>
6 Dynamic Unification Filtering
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="475" end_page="476" type="sub_section">
      <SectionTitle>
('Quick Check')
</SectionTitle>
      <Paragraph position="0"> Our second filter (which we have dubbed the 'quick check') exploits the fact that unification fails more often at certain points in feature structures than at others. For example, syntactic features such as CAW(egory) are very frequent points of failure, whereas unification almost never fails on semantic features which are used merely to accumulate pieces of the logical form. Since all substructures are typed, unification failure is manifested by a type clash when attempting a type unification. The quick check is invoked before each unification attempt to check the most frequent failure points, each stored as a feature path.</Paragraph>
      <Paragraph position="1"> The technique works as follows. First, there is an off-line stage, in which a modified unification engine is used that does not return immediately after a single type unification failure, but instead records in a global data structure the paths at which all such failures occurred.</Paragraph>
      <Paragraph position="2"> Using this modified system a set of sentences is parsed, and the n paths with the highest failure counts are saved. It is exactly these paths that are used later in filtering.</Paragraph>
      <Paragraph position="3"> During parsing, when an active chart item (i.e., a rule schema or a partly instantiated rule schema) and a passive chart item (a lexical entry or previously-built constituent) are combined, the parser has to unify the feature structure of the passive item into the substructure of the active item that corresponds to the argument to be filled. If either of the two structures has not been seen before, the parser associates with it a vector of length n containing the types at the end of the previously determined paths. The first position of the vector contains the type corresponding to the most frequently failing path, the second position the second most frequently failing path, and so on. Otherwise, the existing vectors of types are retrieved. Corresponding elements in the vectors are then type-unified, and full unification of the feature structures is performed only if all the type unifications succeed. null Clearly, when considering the number of paths n used for this technique, there is a trade-off between the time savings from filtered unifications and the effort required to create the vectors and compare them. The main factors involved are the speed of type unification and the percentage of unification attempts filtered out (the 'filter rate') with a given set of paths. The optimum number of paths cannot be determined analytically. Our English, German and Japanese grammars use between 13 to 22 paths for quick check filtering, the precise number having been established by experimentation. The paths derived for these grammars are somewhat surprising, and in many cases do not fit in with the intuitions of the grammar-writers.</Paragraph>
      <Paragraph position="4"> In particular, some of the paths are very long (of length ten or more). Optimal sets of paths for grammars of this complexity could not be produced manually.</Paragraph>
      <Paragraph position="5"> The technique will only be of benefit if type unification is computationally cheap--as indeed it is in our implementation (section 4)--and if the filter rate is high (otherwise the extra work  performed essentially just duplicates work carried out later in unification). There is also overlap between the quick check and the rule filter (previous section) since they are applied at the same point in processing. We have found that (given a reasonable number of paths) the quick check is the more powerful filter of the two because it functions dynamically, taking into account feature instantiations that occur during the parsing process, but that the rule filter is still valuable if executed first since it is a single, very fast table lookup. Applying both filters, the filter rate ranges from 95% to over 98%.</Paragraph>
      <Paragraph position="6"> Thus almost all failing unifications are avoided.</Paragraph>
      <Paragraph position="7"> Compared to the system with only rule application filtering, parse time is reduced by approximately 75% 3 .</Paragraph>
    </Section>
  </Section>
  <Section position="9" start_page="476" end_page="476" type="metho">
    <SectionTitle>
7 Reducing Feature Structure Size
</SectionTitle>
    <Paragraph position="0"> via Restrictors The 'category' information that is attached to each chart item of the parser consists of a single feature structure. Thus a rule is implemented by a feature structure where the daughters have to be unified into predetermined substructures. Although this implementation is along the lines of HPSG, it has the drawback that the tree structure that is already present in the chart items is duplicated in the feature structures.</Paragraph>
    <Paragraph position="1"> Since HPSG requires all relevant information to be contained in the SYNSEM feature of the mother structure, the unnecessary daughters only increase the size of the overall feature structure without constraining the search space.</Paragraph>
    <Paragraph position="2"> Due to the Locality Principle of HPSG (Pollard and Sag, 1987, p. 145ff), they can therefore be legally removed in fully instantiated items. The situation is different for active chart items since daughters can affect their siblings.</Paragraph>
    <Paragraph position="3"> To be independent from a-certain grammatical theory or implementation, we use restrictors similar to (Shieber, 1985) as a flexible and easy-to-use specification to perform this deletion. A positive restrictor is an automaton describing the paths in a feature structure that will remain after restriction (the deletion operation), 3There are refinements of the technique which we have implemented and which in practice produce additional benefits; we will report these in a subsequent paper. Briefly, they involve an improvement to th e path collection method, and the storage of other information besides types in the vectors.</Paragraph>
    <Paragraph position="4"> whereas a negative restrictor specifies the parts to be deleted. Both kinds of restrictors can be used in our system.</Paragraph>
    <Paragraph position="5"> In addition to the removal of the tree structure, the grammar writer can specify the restrictor further to remove features that are only used locally and do not play a role in further derivation. It is worth noting that this method is only correct if the specified restrictor does not remove paths that would lead to future unification failures. The reduction in size results in a speed-up in unification itself, but also in copying and memory management.</Paragraph>
    <Paragraph position="6"> As already mentioned in section 2, there exists a second restrictor to get rid of unnecessary parts of the lexical entries after lexicon processing. The speed gain using the restrictors in parsing ranges from 30% for the German system to 45% for English.</Paragraph>
  </Section>
  <Section position="10" start_page="476" end_page="477" type="metho">
    <SectionTitle>
8 Limiting the Number of Initial
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="476" end_page="477" type="sub_section">
      <SectionTitle>
Chart Items
</SectionTitle>
      <Paragraph position="0"> Since the number of lexical entries per stem has a direct impact on the number of parsing hypotheses (in the worst case leads to an exponential increase), it would be a good idea to have a cheap mechanism at hand that helps to limit these initial items. The technique we have implemented is based on the following observation: in order to contribute to a reading, certain items (concrete lexicon entries, but also classes of entries) require the existence of other items such that the non-existence of one allows a safe deletion of the other (and vice versa). In German, for instance, prefix verbs require the right separable prefixes to be present in the chart, but also a potential prefix requires its prefix verb.</Paragraph>
      <Paragraph position="1"> Note that such a technique operates in a much larger context (in fact, the whole chart) than a local rule application filter or the quick-check method. The method works as follows. In a preprocessing step, we first separate the chart items which encode prefix verbs from those items which represent separable prefixes. Since both specify the morphological form of the prefix, a set-exclusive-or operation yields exactly the items which can be safely deleted from the chart.</Paragraph>
      <Paragraph position="2"> Let us give some examples to see the usefulness of this method. In the sentence Ich komme mo,'ge,~ (I (will) come tomorrow), komme maps  onto 97 lexical entries--remember, komme might encode prefix verbs such as ankommen (arrive), zuriickkommen (come back), etc. although here, none of the prefix verb readings are valid, since a prefix is missing. Using the above method, only 8 of 97 lexical entries will remain in the chart. The sentence Ich komme morgen an (I (will) arrive tomorrow) results in 8+7 entries for komme (8 entries for the come reading together with 7 entries for the arrive reading of komme) and 3 prepositional readings plus 1 prefix entry for an. However in Der Mann wartet an der Tiir (The man is waiting at the door), only the three prepositional readings for an come into play, since no prefix verb anwartet exists. Although there are no English prefix verbs, the method also works for verbs requiring certain particles, such as come, come along, come back, come up, etc.</Paragraph>
      <Paragraph position="3"> The parsing time for the second example goes down by a factor of 2.4; overall savings w.r.t, our reference corpus is 17% of the parsing time (i.e., speed-up factor of 1.2).</Paragraph>
    </Section>
  </Section>
  <Section position="11" start_page="477" end_page="478" type="metho">
    <SectionTitle>
9 Computing Best Partial Analyses
</SectionTitle>
    <Paragraph position="0"> Given deficient, ungrammatical, or spontaneous input, a traditional parser is not able to deliver a useful result. To overcome this disadvantage, our approach focuses on partial analyses which are combined in a later stage to form total analyses without giving up the correctness of the overall deep grammar. But what can be considered good partial analyses? Obviously a (sub)tree licensed by the grammar which covers a continuous part of the input (i.e., a passive parser edge). But not every passive edge is a good candidate since otherwise we would end up with perhaps thousands of them. Instead, our approach computes an 'optimal' connected sequence of partial analyses which cover the whole input. The idea here is to view the set of passive edges as a directed graph and to compute shortest paths w.r.t, a user-defined estimation function.</Paragraph>
    <Paragraph position="1"> Since this graph is acyclic and topologically sorted, we have chosen the DAG-shortest-path algorithm (Cormen et al., 1990) which runs in O(V + E). We have modified this algorithm to cope with the needs we have encountered in speech parsing: (i) one can use several start and ~nd vertices (e.g., in case of n-best chains or word graphs); (ii) all best shortest paths are returned (i.e., we obtain a shortest-path subgraph); (iii) estimation and selection of the best edges is done incrementally when parsing n-best chains (i.e., only new passive edges entered into the chart are estimated and perhaps selected). This approach has one important property: even if certain parts of the input have not undergone at least one rule application, there are still lexical edges which help to form a best path through the passive edges. This means that we can interrupt parsing at any time, but still obtain a useful result.</Paragraph>
    <Paragraph position="2"> Let us give an example to see how the estimation function on edges (-- trees) might look like (this estimation is actually used in the German grammar): * n-ary tree (n &gt; 1) with utterance status (e.g., NPs, PPs): value 1 * lexical items: value 2 * otherwise: value c~ This approach does not always favor paths with longest edges as the example in figure 2 shows--instead it prefers paths containing no lexical edges (where this is possible) and there might be several such paths having the same cost. Longest (sub)paths, however, can be obtained by employing an exponential estimation function. Other properties, such as prosodic information or probabilistic scores could also be utilized in the estimation function. A detailed description of the approach can be found in (Kasper et al., 1999).</Paragraph>
    <Paragraph position="3">  Note that the paths PR and QR are chosen, but not ST, although S is the longest edge.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML