File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/85/p85-1017_metho.xml
Size: 20,531 bytes
Last Modified: 2025-10-06 14:11:48
<?xml version="1.0" standalone="yes"?> <Paper uid="P85-1017"> <Title>A Structure-Sharing Representation for Unification-Based Grammar Formalisms</Title> <Section position="5" start_page="137" end_page="137" type="metho"> <SectionTitle> 3 The Problem </SectionTitle> <Paragraph position="0"> In a chart parser \[31 all the intermediate stages of derivations are encoded in ed0es, representing either incomplete (active) or complete (pensive) phra.ses. For PATR-\[I, each edge contains a dag instance that represents the phrase type of that edge. The problem we address here is how to encode multiple dag instances efficiently.</Paragraph> <Paragraph position="1"> \[n a chart parser for context-free grammars, the solution is trivial: instances can be represented by the unique interhal names (that is, addresses) of their objects because the information contained in an instance is exactly the same a.s that in the original object.</Paragraph> <Paragraph position="2"> \[n a parser for PATR-|I or any other unification-based forrealism, however, distinct instances of an object will in general specify different values for attributes left unspecified in the original object. Clearly, the attribute values specified for one instance are independent of those for another instance of the same object.</Paragraph> <Paragraph position="3"> One obvious solution is to build new instances by copying the original object and then updating the copy with the new attribute values. This was the solution adopted in the first PATR-II parser \[8\]. The high cost of this solution both in time spent copying and in space required for the copies thenmelves constitutes the principal justification for employing the method described here.</Paragraph> </Section> <Section position="6" start_page="137" end_page="138" type="metho"> <SectionTitle> 4 Structure Sharing </SectionTitle> <Paragraph position="0"> Structure sharing is based on the observation that an initial object, together with a list of update records, contains the same information as the object that results from applying the updates to the initial object. In this way, we can trade the cost of actually applying the updates (with possible copying to avoid the destruction of the source object) against the cost of having to compute the effects of updates when examining the derived object. This reasoning applies in particular to dag instances that are the result of adding attribute values to other instances.</Paragraph> <Paragraph position="1"> As in the variant of Boyer and Moore's method \[1\] used in Prolog \[9\], I shall represent a dag instance by a molecule (see Figure 2) consisting of 1. \[A pointer to\] the initial dag, the instance's skeleton 2. \[A pointer to\] a table of updates of the skeleton, the instance's environment.</Paragraph> <Paragraph position="2"> Environments may contain two kinds of updates: reroutings that replace a dag node with another dag; are bindings that add to a node a new outgoing arc pointing to a dag. Figure 3 shows the unification of the dags</Paragraph> <Paragraph position="4"> After unification, the top node of/2 is rerouted to It and the top node of \[i gets an arc binding with label c and a value that is the subdag \[d : e\] of/2. As we shall see later, any update of a dag represented by a molecule is either an update of the molecule's skeleton or an update of a dag (to which the same reasoning applies) appearing in the molecule's enviroment. Therefore, the updates in a molecule's environment are always shown in figures tagged by a boxed number identifying the affected node in the molecule's skeleton.</Paragraph> <Paragraph position="5"> The choice of which dag is rerouted and which one gets arc bindings is arbitrary.</Paragraph> <Paragraph position="6"> For reasons discussed later, the cost of looking up instance node updates in Boyer and Moore's environment representation is O(\]dl), where \[d\[ is the length of the derivation (a ~equence of resolutions) of the instance. In the present representation, however, this cost is only O(Iog \]d\]). This better performance is achieved by particularizing the environment representation and by splitting the representational scheme into two components: a memory organization and a daft repre.sentation. null A dag representation is & way of mapping the mathematical entity dag onto a memory. A memory organization is a way of putting together a memory that has certain properties with respect to lookup, updating and copying. One can think of the memory organization as the hardware and the dag representation as the data structure.</Paragraph> </Section> <Section position="7" start_page="138" end_page="139" type="metho"> <SectionTitle> 5 Memory organization </SectionTitle> <Paragraph position="0"> In practice, random-access memory can be accessed and updated in constant time. However, updates destroy old values, which is obviously unacceptable when dealing with alternative updates of the same data structure. If we want to keep the old version, we need to copy it first into a separate part of memory and change the copy instead. For the normal kind of memory, copying time is proportional to the size of the object copied.</Paragraph> <Paragraph position="1"> The present scheme uses another type of memory organization -- virtual-copy array~ ~ which requires O(logn) time to access or update an array with highest used index</Paragraph> <Paragraph position="3"> of n, but in which the old contents are not destroyed by updating. Virtual-copy arrays were developed by David H. D.</Paragraph> <Paragraph position="4"> Warren \[10\] as an implementation of extensible arrays for Prolog.</Paragraph> <Paragraph position="5"> Virtual-copy arrays provide a fully general memory ~tructure: anything that can be stored in r,'tndom-a,-ces~ memory can be stored in virtual-copy arrays, althoqlgh p,~mters in machine memory correspond to indexes in a virtual-copy array. An updating operation takes a virtual-copy array, an index, and a new value and returns a new virtual-copy array with the new value stored at the given index. An access operation takes an array and an index, and returns the value at that index.</Paragraph> <Paragraph position="6"> Basically, virtual-copy arrays are 2k-ary trees for some fixed k > 0. Define the depth d(n) of a tree node n to be 0 for the root and d(p) + I if p is the parent of n. Each virtual-copy array a has also a positive depth D(a) > max{d(n) : n is a node of a}. A tree node at depth D(a) (necessarily a leaf) can be either an array element or the special marker .L for unassigned elements. All leaf nodes at depths lower than D(a) are also +-, indicating that no elements have yet been stored in the subarray below the node. With this arrangement, the array can store at most 2 kdeg('l elements, numbered 0 through 2 kdeg~*l - l, but unused sdbarrays need not be allocated.</Paragraph> <Paragraph position="7"> By numbering the 2 h daughters of a nonleaf node from 0 to 2 k - 1, a path from a's root to an array element (a leaf at depth D(a)) can be represented by a sequence no... no(ab-t in which n, is the number of the branch taken at depth d.</Paragraph> <Paragraph position="8"> This sequence is just the base 2 k representation of the index n of the array element, with no the most significant digit and no(.} the least significant (Figure .t).</Paragraph> <Paragraph position="9"> When a virtual-copy array a is updated, one of two things may happen. If the index for the updated element exceeds the maximum for the current depth (,a~ in the a\[8\] := ~/update in Figure 5), a new root node is created for the updated array and the old array becomes the leftmost daughter of the new root. Other node,, are also created, as appropriate, to reach the position of the new element. If, on the other hand, the index for the update is within the range for the current depth, the path from the root to the element being updated is copied and the old element is replaced in the new tree by the new element (as in the a\[21 := h update in Figure 5).</Paragraph> <Paragraph position="10"> This description assumes that the element being updated has alroady been set. If not, the branch to the element may T,,rminate prematurely in a 2. leaf, in which case new nodes are created to the required depth and attached to the appropriate position at the end of the new path from the root.</Paragraph> </Section> <Section position="8" start_page="139" end_page="141" type="metho"> <SectionTitle> 6 Dag representation </SectionTitle> <Paragraph position="0"> Any dug representation can be implemented with virtual-copy memory instead of random-access memory. If that were ,lone for the original PATR-II copying implementation, a certain measure of structure sharing would be achieved.</Paragraph> <Paragraph position="1"> The present scheme, however, goes well b~yond that by using the method of structure sharing introduced in Section 4. As we saw there, an instance object is represented by a molecule, a pair consisting of a skeleton dug {from a rule or iexical entry) and an update environment. We shall now examine the structure of environments.</Paragraph> <Paragraph position="2"> In a chart parser for PATR-ll, dug instances in the chart fall into two classes.</Paragraph> <Paragraph position="3"> Base in.stances are those associated with edges that are created directly from lexical entries or rules.</Paragraph> <Paragraph position="4"> Derived instances occur in edges that result from the combination of a left and a right parent edge containing the left and right parent instances of the derived instance. The left ancestors of an instance {edge) are its left parent and that parent's ancestors, and similarly for right ancestors, l will assume, for ease of exposition, that a derived instance is always a subdag of the unification of its right parent with a subdag of its left parent. This is the case for most common parsing algorithms, although more general schemes are possible \[7\].</Paragraph> <Paragraph position="5"> If the original Boyer-Moore scheme were used directly, the environment for a derived instance would consist of pointers to left and right parent instances, as well as a list of the updates needed to build the current instance from its parents. As noted before, this method requires a worst-case O(Idl} search to find the updates that result in the current instance.</Paragraph> <Paragraph position="6"> The present scheme relies on the fact that in the great majority of cases no instance is both the left and the right ancestor of another instance. \[ shall assume for the moment that this is always the case. In Section 9 this restriction will be removed.</Paragraph> <Paragraph position="7"> It is asimple observation about unification that an update of a node of an instance \]&quot; is either an update of \['s skeleton or of the value (a subdag of another instance) of another update of L If we iterate this reasoning, it becomes clear that every update is ultimately an update of the skeleton of a base instance ancestor of \[. Since we assumed above that no instance could occur more than once in it's derivation, we can therefore conclude that \['s environment consists only of updates of nodes in the skeletons of its base instance ancestors. By numbering the base instances of a derivation consecutively, we can then represent an environment by an array of frames, each containing all the updates of the skeleton of a given base instance.</Paragraph> <Paragraph position="8"> Actually, the environment of an instance \[ will be a branch environment containing not only those updates directly relevant to \[, but also all those that are relevant to the instances of/'s particular branch through the parsing search space.</Paragraph> <Paragraph position="9"> In the context of a given branch environment, it is then possible to represent a molecule by a pair consisting of a skeleton and the index of a frame in the environment. In particular, this representation can be used for all the value~ (dags) in updates.</Paragraph> <Paragraph position="10"> More specifically, the frame of a base instance is an array of update records indexed by small integers representing the nodes of the instance's skeleton. An update record is either a list of arc bindings for distinct arc labels or a rerouting update. An arc binding is a pair consisting of a label and a molecule (the value of the arc binding). This represents an addition of an arc with that label and that value at th,, given node. A rerouting update is just a pointer to another molecule; it says that the subdag at that node in the updated dug is given by that molecule (rather than by whatever w,xs in the initial skeleton).</Paragraph> <Paragraph position="11"> To see how skeletons and bindings work together to represent a dag, consider the operation of finding the sub(tag d/(It'&quot;lm) of dug d. For this purpose, we use a current skeleton s and a current frame f, given initially by the skeleton and frame of the molecule representing d. Now assume that the current skeleton s and current frame ,f correspond to the subdag d' -- d/(ll.., l~-l). To find d/(l~.., l~) -&quot; ~/l~, we use the following method: I. If the top node of s has been rerouted in j&quot; to a dag v, dereference PS by setting s and .f from v and repeating this step; otherwise 2. If the top node of s has an arc labeled by l~ with value s', the subdag at l~ is given by the moledule (g,\[); otherwise 3. If .f contains an arc binding labeled l~ for the top node of s, the subdag at l~ is the value of the binding If none of these steps can be applied, (It .-. l~) is not a path from the root in d.</Paragraph> <Paragraph position="12"> The details of the representation are illustrated by the example in Figure 6, which shows the passive edges for the chart analysis of the string ab according to the sample gram- null For the sake of simplicity, only the subdags corresponding to the explicit equations in these rules are shown (ie., the cat dug arcs and the rule arcs 0, 1,... are omitted}. In the figure, the three nonterminal edges (for phrase types S, .4 and B) are labeled by molecules representing the corresponding dags. The skeleton of each of the three molecules comes from the rule used to build the nonterminal. Each molecule points (via a frame index not shown in the figure) to a frame in the branch environment. The frames for the A and B edges contain arc bindings for the top nodes of the respective skeletons whereas the frame for the S edge reroute nodes 1 and 2 of the S rule skeleton to the A and B molecules respectively.</Paragraph> </Section> <Section position="9" start_page="141" end_page="142" type="metho"> <SectionTitle> 7 The Unification Algorithm </SectionTitle> <Paragraph position="0"> I shall now give the u~nification algorithm for two molecules (dags} in the same branch environment.</Paragraph> <Paragraph position="1"> We can treat a complex dug d a8 a partial function from labels to dags that maps the label on each arc leaving the top node of the dag to the dug at the end of that arc. This allows us to define the following two operations between dags:</Paragraph> <Paragraph position="3"> It is clear that dom(dl <~ d~) = dom(d2 <~ dl).</Paragraph> <Paragraph position="4"> We also need the notion of dug dereferencing introduced in the last section. As a side effect of successive unifications, the top node of a dag may be rerouted to another dag whose top node will also end up being rerouted. Dereferencing is the process of following such chains of rerouting pointers to reach a dug that has not been rerouted.</Paragraph> <Paragraph position="5"> The unification of dags dl and d~ in environment e consists of the following steps: 1. Dereference dl and d2 2. If dl and d: are identical, the unification is immediately dl to d~; otherwise If d2 is a leaf, add to e a rerouting from the top node of d2 to dl; otherwise If dl and d2 are complex dags, for each arc (l, d) E dl <~ d= unify the dag d with the dag d' of the corresponding arc (i,d') G d~ <l dl. Each of those unifications may add new bindings to e. If this unification of subdags i.~ successful, all the arcs in dl \ d~ are are cab'red in e ~ arc bindings for the top node of d: and tinnily the top node of dl is rerouted to d~.</Paragraph> <Paragraph position="6"> If none of the conditions above applies, the unification fails.</Paragraph> <Paragraph position="7"> To determine whether a dag node is a leaf or complex, both the skeleton and the frame of the corresponding molecule must be examined. For a dereferenced molecule. the set of arcs leaving a node is just the union of the skeleton arcs and the arc bindings for the node. For this to make sense, the skeleton arcs and arc bindings for any molecule node must be disjoint. The interested reader will have no di~cuhy in proving that this property is preserved by the unification algorithm and therefore all molecules built from skeletons and empty frames by unification wiU satisfy it. deg 8 Mapping dags onto virtual-copy memory As we saw above, any dag or set of dags constructed by the parser is built from just two kinds of material: (I) frames; (21 pieces of the initial skeletons from rules and \[exical entries. The initial skeletons can be represented trivially by host language data structures, as they never change. F~'ames, though, are always being updated. A new frame is born with the creation of an instance of a rule or lexical entry when the rule or entry is used in some parsing step (uses of the same rule or entry in other steps beget their own frames). A frame is updated when the instance it belongs to participates in a unification.</Paragraph> <Paragraph position="8"> During parsing, there are in general several possible ways of continuing a derivation. These correspond to alternative ways of updating a branch environment. In abstract terms, on coming to a choice point in the derivation with n possible continuations, n - 1 copies of the environment are made, giving n environments -- namely, one for each alternative. In fact. the use of virtual-copy arrays for environments and frames renders this copying unnecessary, so each continuation path performs its own updating of its version of the environment without interfering with the other paths. Thus, all unchanged portions of the environment are shared.</Paragraph> <Paragraph position="9"> In fact, derivations as such are not explicit in a ,'hart parser. Instead, the instance in each edge has its own branch ,,nvironment, as described previously. Therefore. when two e,lges are combined, it is necessary to merge their environments. The cost of this merge operation is at most the same the worst case cost for unification proper (O(\[d\[ log JdJ)). However, in the very common case in which the ranges of frame indices of the two environments do not overlap, the merge cost is only O(log \[d\[).</Paragraph> <Paragraph position="10"> To summarize, we have sharing at two levels: the Boyer-Moore style dag representation allows derived (lag instances to share input data structures (skeletons), and the virtual-copy array environment representation allows different branches of the search space to share update records.</Paragraph> </Section> <Section position="10" start_page="142" end_page="142" type="metho"> <SectionTitle> 9 The Renaming Problem </SectionTitle> <Paragraph position="0"> In the foregoing discussion of the structure-sharing method, \[ assumed that the left and right ancestors of a derived instance were disjoint. In fact, it is easy to show that the condition holds whenever the graHtm;tr d.'s nC/)t ~.llow elllpty deriv(,d edges.</Paragraph> <Paragraph position="1"> In ,',mtrast, it is p,)ssible t,) construct a grammar in which an empty derived edge with dag D is b.th a left and a right ancestor of another edge with dag E. Clearly, tile two uses (~f D a.s an ancestor of E are mutually independent and the corresponding updates have to be seqregated. In ,~ther words, we need two ,'~l)ies of tile instance D. 13v anal,,~' with theorem proving, \[ call Ihi~ lhe renaminq pr~d,h,m.</Paragraph> <Paragraph position="2"> The ('nrreflt sol|,t.i(,n is t,) us,, real ,'(,I)YiV|g t,) turn th,, empty edge into a skelet(>n, which is the ||adde~l t~ the chart. The new skeleton is then used in the norn|al fa.shion to produce multiple instances that are free of mutual interference.</Paragraph> </Section> class="xml-element"></Paper>