File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/65/c65-1020_intro.xml

Size: 22,206 bytes

Last Modified: 2025-10-06 14:04:14

<?xml version="1.0" standalone="yes"?>
<Paper uid="C65-1020">
  <Title>CONSTRUCTION CODE ASSOCIATED WITH M,</Title>
  <Section position="3" start_page="0" end_page="0" type="intro">
    <SectionTitle>
ENDOCENTRIC CONSTRUCTIONS AND THE
COCKE PARSING LOGIC
</SectionTitle>
    <Paragraph position="0"> Automatic sentence structure determination (SSD) is greatly simplified if, through the intervention of a parsing logic, the grammatical rules that determine the structure are partially disengaged from the computer routines that apply to them. Some earlier parsing programs analyzed sentences by routines that branched according to the grammatical properties or signals encountered at particular points in the sentence, making the routines themselves serve as the rules. This not only required separate programs for each language, but led to extreme proliferation in the routines, requiring extensive rewriting and debu~gin~ with every discovery and incorporation of a new ~rammatical feature. More recently, programs for SSD have employed generalized parsing logics, applicable to different languages and providing primarily for an exhaustive and systematic application of a set of rules. (1,2,5,5) The rules themselves can be changed without changing the routines that apply them, and the routines consequently take fuller advantage of the speed with which digital computers can repeat the same sequence of instructions over and over again, changing only the values of some parameters at each cycle.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
Robinson S
</SectionTitle>
      <Paragraph position="0"> The case in point is the parsing logic (PL) devised by John Cocke in 1960, for applying the rules of a context-free phrase structure grammar (PSG), requiring that each structure recognized by the grammar be analyzed into two and only two immediate constituents.(I) Although all PSGs appear to be inadequate in some important respects to the task of handling natural language, they still form the base of the more powerful transformational grammars, which are not yet automated for SSD. Moreover, even their severest critic acknowledges that &amp;quot;The PSG conception of grammar...is a quite reasonable theory of natural language which unquestionably formalizes many actual properties of human language.&amp;quot;(6,P &amp;quot;78) Both theoretically and empirically the development and automatic application of PSGs are of interest to linguists.</Paragraph>
      <Paragraph position="1"> The PSG on which the Cocke Pl, operates is essentially a table of constructions. Its rules have three entries, one for the code (a descriptor) of the construction, the other two specifying the codes of the ordered pair of immediate constituents out of which it may be formed.</Paragraph>
      <Paragraph position="2"> The logic iterates in five nested loops, controlled by three simple parameters and two codes supplied by the grammar. They are: i) the string length, starting with length 2, of the segment being tested for constructional Robinson 6 status; 2) the position of the first word in the tested string; 3) the length of the first constituent; 4) the codes of the first constituent; and 5) the codes of the second constituent.</Paragraph>
      <Paragraph position="3"> After a dictionary lookup routine has assi~.ned grammar codes to all the occurrences in the sentence or total string to be parsed (it need not be a sentence), the PL operates to offer the codes of pairs of adjacent segments to a parsing routine that tests their connectability by looking them up in the stored table of constructions, i.e., in the grammar. If the ordered pair is matched by a pair of ICs in the table, tile code of the construction formed by the ICs is added to the list of codes to be offered for testin~ when iterations are performed on longer strings. In the RAND program for parsing English, the routines produce a labeled binary-branching tree for every complete structural analysis. There will be one tree if the grammar recognizes the string as well-formed and syntactically unambiguous; more than one if it is recognized as ambiguous. Even if no complete analysis is made of the whole string, a resum~ lists all constructions found in the process, including those which failed of inclusion in larger constructions. (8,9) *This interaction between a PL and a routine for testing the connectability of two items is described in somewhat greater detail in Hays (2).</Paragraph>
      <Paragraph position="4"> Rob ins on 7 Besides simplifying the problem of revising the grammar by separating it from the problem of application to sentences, the PL, because it leads to an exhaustive application of the rules, permits a rigorous evaluation of the grammar's ability to assign structures to sentences and also reveals many unsuspected yet legitimate ambiguities in those sentences.(4, 7) But because of the difficulties inherent in specifying a sufficiently discriminatory set of rules for sentences of any natural language and because of the very many syntactic ambiguities, resolvable only through lar~er context, this method of parsing produces a long list of intermediate constructions for sentences of even modest length, and this in turn raises a storage prob lem.</Paragraph>
      <Paragraph position="5"> By way of illustration, consider a string of four occurrences, x I x 2 x 3 x4, a dictionary that assigns a single grammar code to each, and a grammar that assigns a unique construction code to every different combination of adjacent segments. Given such a grammar, as in Table I, the steps in its application to the string by the parsing routines operating with the Cocke PL are represented in Table II. (The preliminary dictionary lookup assigning the original codes to the occurrences is treated as equivalent to iterating with the parameter for string length set to I).</Paragraph>
      <Paragraph position="6">  I. i I i A A 2. 1 2 i B B 3. I 3 \] c c 4. 1 4 1 D D Dictionary x I lookup x 2 assio~ning x 3 codes to: x 4</Paragraph>
      <Paragraph position="8"/>
      <Paragraph position="10"> code of second const, string code for string, to be stored when C(P) and C(Q) are matched in the o~r_ammar.</Paragraph>
      <Paragraph position="12"> The boxed section represents the PL iterations.</Paragraph>
      <Paragraph position="13"> Robinson 9 With such a grammar, the number of constructions to be stored and processed through each cycle increases in proportion to the cube of the number of words in the sentence. If the dictionary and grammar assign more than one code to occurrences and constructions, the number may grow multiplicatively, making the storage problem still more acute. For example, if x I were assigned two codes instead of one, additional steps would be required for every string in which x I was an element and iteration on string length 4 would require twice as many cycles and twice as much storage.</Paragraph>
      <Paragraph position="14"> Of course, reasonable grammars do not provide for combining every possible pair of adjacent segments into a construction, and in actual practice the growth of the construction list is reduced by failure to find the two codes presented by the PL, when the grammar is consulted. If Rule i is omitted from the grammar in Table I, then steps S, 9, 14, and 16 will disappear from Table II and both storage requirements and processing time will be cut down. Increasing the discriminatory power of the grammar through refining the codes so that the first occurrence must belong to class Aa and the second to class Bb in order to form a construction provides this limiting effect in essentially the same way.</Paragraph>
      <Paragraph position="15"> Robinson I0 Another way oPS limiting the growth oPS the stored constructions is to take advantage of the fact that in actual grammars two or more different pairs of constituents sometimes combine to produce the &amp;quot;same&amp;quot; construction. Assume that A and F (Table I) combine to form a construction whose syntactic properties are the same, at least within the discriminatory powers of the grammar, as those of the construction formed by E and C. Then Rules 4 and S can assign the same code, }l, to their constructions. In consequence, at both steps 8 and 9 in the parsing (Table If), |1 will be stored as the construction code C(M) for the string x I x 2 x3, even though two substructures are recorded for it: i.e. (Xl(X 2 + x3) ) and ((x I + x2)x3). The string can be marked as having more than one structure, but in subsequent iterations on string length 4, only one concatenation of the string with x 4 need be made and step 16 can be omitted. When the parsing has terminated, all substructures of completed analyses are recoverable, including those of marked strings.</Paragraph>
      <Paragraph position="16"> Eliminating duplicate codes for the same string from the cycles of the PL results in dramatic savings in time and storage, partly because the elimination of any step has a cumulative effect, as demonstrated previously. In addition, opportunities to eliminate duplicates arise frequently, in English at least, because of the frequent Rob in s on 11 occurrence oPS endocentric constructions, .constructions whose syntactic properties are largely the same as those oPS one of their elements--the head. In English~ noun phrases are typically endocentric, and when a noun head is flanked by attributives as in a phrase consisting of article, noun, prepositional phrase (A N PP), the requirement that constructions have only two ICs promotes the assignment of two structures, (A(N+PP)) and (~A+N) PP), unless the grammar has been carefully formulated to avoid it. Since NPs of this type are ubiquitous, occurrinp, as subjects, objects of verbs, and objects of prepositions, duplicate codes for them are likely to occur at several points in a sentence.</Paragraph>
      <Paragraph position="17"> Consideration of endocentric constructions, however, raises other questions, some theoretical and some practical, suggesting modification of the grammar and the parsing routines in order to represent the language more accurately or in order to save storage, or both. Theoretically, the problem is the overstructuring of noun phrases by the insistence on two ICs and the doubtful propriety of permitting more than one way of structuring them.</Paragraph>
      <Paragraph position="18"> Practically, the problem is the elimination of duplicate construction codes stored for endocentric phrases when the codes are repeated for different string lengths.</Paragraph>
      <Paragraph position="19"> Robinson 12 Consider the noun phrase subject in All the old men on the corner sta.red. Its syntactic properties are essentially the same as that of men. But fifteen other phrases, all made up from the same elements but varying in length, also have the same properties. They are shown below:  All the old men on the corner The old men on the corner All the men on the corner All old men on the corner Old men on the corner The men on the corner All men on the corner  A reasonably good grammar should provide for the recognition of all sixteen phrases. This is not to say that sixteen separate rules are required, although this would be one way of doing it. Minimally, the grammar must provide two rules for an endocentric NP, one to combine the head noun or the string containing it with a preceding attributive and another to combine it with a following Robinson 13 attributive. The codes for all the resulting constructions may be the same, but even so, the longest phrase will receive four different structural assignments or bracketings as its adjacent elements are gathered together in pairs; (all (the (old (men (on the corner))))) (all (the ((old men) (on the corner)))) (all ((the (old men)) (on the corner))) and ((all (the (old men))) (on the corner)) If it is assumed that the same code, say that of a plural NP, has been assigned at each string length, it is true that only one additional step is needed to concatenate the string with the following verb when the PL iteration is performed for string length 8. But meanwhile a number of intermediate codes have been stored during iterations on string lengths 5, 6, and 7 as the position of the first word of the tested string was advanced, so that the list also contains codes for: men on the corner stared (length 5) old men on the corner stared (length 6) and the old men on the corner stared (length 7) Again, the codes may be the same, but duplicate codes will not be eliminated from processing if they are associated with different strings, and strings of different length are treated as wholly different by the PL, regardless of overlap. If this kind of duplication is to be reduced or name ly: Robinson 14 avoided, a different procedure is required from that available for the case of simple duplication over the same string.</Paragraph>
      <Paragraph position="20"> But first a theoretical question must be decided.</Paragraph>
      <Paragraph position="21"> Is the noun phrase, as exemplified above, perhaps really four-ways ambiguous and do the four different bracketings correlate systematically with four distinct interpretations or assignments of semantic structure? (Cf&amp;quot; 4,7) And if so, is it desirable to eliminate them? It is possible to argue that some of the different bracketings do correspond to different meanings or emphases, or--in earlier transformational terms--to different orderings in the embeddings of the men were old and the men were on the corner into all the men stared. Admittedly the native speaker can indicate contrasts in meaning by his intonation, emphasizing in one reading that all the men stared and in another that it was all the ol___dd men who stared; and the writer can resort to italics. But it seems reasonable to assume that there is a normal intonation for the unmarked and unemphatic phrase and that its interpretation is structurally unambiguous. In the absence of italics and other indications, it seems ~_~_reasonable to produce four different bracketings at every encounter with an NP of the kind exemplified.</Paragraph>
      <Paragraph position="22"> Robinson 15 One way to reduce the duplication is to write the grammar codes so that, with the addition of each possible element, the noun head is assigned a different construction code whose distribution as a constituent in larger constructions is carefully limited. For the sake of simplicity, assume that the elements of NPs have codes that reflect, in part, their ordering within the phrase and that the NP codes themselves reflect the properties of the noun head in first position and are subsequently differentiated by codes in later positions that correspond to those of the attributes. Let the codes for the elements be 1 (all), 2 (the), 3 (old), 4 (men), 5 (on the corner). Rules may be written to restrict the combinations, as follows:  (all the old men) (all men on the corner); but not &amp;quot;41 + S / 415 (the men on the corner); but not *42 + 5 / 425 (old men on the corner); but not *43 + 5 / 435 (the old men on the corner); but not *423 + 5 / 4235 (all the old men on the corner); but not &amp;quot;4123 + 5 / 41235 With these rules, the Rrammar provides for only one structural assignment to the string: (all (the (old (men + on the corner)))).</Paragraph>
      <Paragraph position="23"> This method has the advantage of acknowledging the general endocentricity of the NP while allowing for its limitations, so that where the subtler differences among NPs are not relevant, they can be ignored by ignoring certain positions of the codes, and where they are relevant, the full codes are available. The method should lend Robinson 17 itself quite well to code matching routines for connectability. However, if carried out fully and consistently, it greatly increases the length and complexity of both the codes and the rules, and this may also be a source of problems in storage and processing time. (cf. Flays, 2) Another method is to make use of a classification of the rules themselves. Since the lowest loop of the PL (see Fig. I) iterates on the codes of the second constituents, the rules against which the paired strings are tested are stored as ordered by first IC codes and subordered by second IC codes. If the iterations of the logic were differently ordered, the rules would also be differently ordered, for efficiency in testing. In other words, the code of one constituent in the test locates a block of rules within which matches for all the codes of the other constituent are to be sought; but the hierarchy of ordering by one constituent or the other is a matter of choice so long as it is the same for the PL and for storing the table of rules that constitute the grammar. In writing and revising the rules, however~ it proves humanly easier if they are grouped according to construction types.</Paragraph>
      <Paragraph position="24"> Accordingly, all endocentric NPs in the RAND grammar are given rule identification tags with an A in first position. Within this grouping, it is natural to subclass the rules according to whether they attach attributives on the right Robinson 18 or on the left of the noun head. If properly formalized, this practice can lead to a reduction in the multiple analyses of NPs with fewer rules and simpler codes than those of the previous method.</Paragraph>
      <Paragraph position="25"> As applied to the example, the thirteen rules and five-place codes of Table IV can be reduced to two rules with one-place codes and an additional feature in the rule identification tag.</Paragraph>
      <Paragraph position="26"> *AI The rules can be written as:  Although the construction codes are less finely differentiated, the analysis of the example will still be unique, and the number of abortive intermediate constructions will be reduced. To achieve this effect, the connectability test routine must include a comparison of the rule tag associated with each C(P) and the rule tags of the grammar. If a rule of type *A is associated with the C(P), that is, if an *A rule assigned the construction code to the string P which is now being tested as a possible first constituent, then no rule of type $A can be used in the current test. For all such rules, there will be an automatic &amp;quot;no match&amp;quot; without checking the second constituent codes. (See Fig. I.) As a consequence of this restriction, in Robinson 19 the final analysis, the noun head will have been combined with all attributives on the right before acquiring any on the left.</Paragraph>
      <Paragraph position="27"> To be sure, the resume of intermediate constructions will contain codes for ol___dd men, the old men, and all the ol__.dd me__n_n , produced in the course of iterations on string lengths 2, 3, and 4, but only one structure is finally assigned to the whole phrase and the intermediate duplications of codes for strings of increasing length will be fewer because of the hiatus at string length 5. Of course, in the larger constructions in which the NP participates, the reduction in the number of stored intermediate constructions will be even greater.</Paragraph>
      <Paragraph position="28"> Provisions may be made in the rules for attaching still other attributives to the head of the NP without great increase in complexity of rules or multiplication of structural analyses. Rule $A2, for example, could include provision for attaching a relative clause as well as a prepositional phrase, and while a phrase like the men on the corner who were sad might receive two analyses unless the codes were sufficiently differentiated to prevent the clause from being attached to corner as well as to me___n, at least the further differentiation of the codes need not also be multiplied in order to prevent the multiple analyses arising from endocentricity.</Paragraph>
      <Paragraph position="29"> Robinson 20 Similarly, for verb phrases where the rule must allow for an indefinite number of adverbial modifiers, a single analysis can be obtained by marking the strings and the rules and forcing a combination in a single direction. In short, although the Cocke PL tends to promote multiple analysis of unambiguous or trivially ambiguous endocentric phrases, at the same time increasing the problem of storing intermediate constructions, the number of analyses can be greatly reduced and the storage problem greatly alleviated if the rules of the grammar recognize endocentricity wherever possible and if they are classified so that rules for endocentric constructions are marked as left (*) or right ($), and their order of application is specified.</Paragraph>
      <Paragraph position="30"> A final theoretical-practical consideration can at least be touched on, although it is not possible to develop it adequately here. The foregoing description provided for combining a head with its attributives (or dependents) on the right before combining it with those on the left, but either course is possible. Which is preferable depends on the type of construction and on the language generally. If Yngve's hypothesis that languages are essentially asymmetrical, tending toward right-branching constructions to avoid overloading the memory, is correct, then the Robinson 21 requirement to combine first on the right is preferable. (10) This is a purely grammatical consideration, however, and does not affect the procedure sketched above, in principle. For example, consider an endocentric construction of string length 6 with the head at position 3, so that its extension is predominantly to the right, thus: 1 2 (3) 4 5 6. If all combinations were allowed by the rules, there would be thirty-four analyses. If combination is restricted to either direction, left or right, the number of analyses is reduced to eleven. However, if the Cocke PL is used to analyze a left-branching language, making it preferable to specify prior combination on the left, then the order of nesting of the fourth and fifth loops of the PL should be reversed (Fig. I) and the rules of the grammar should be stored in order of their second constituent codes, subordered on those of the first constituents.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML