File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/89/e89-1035_metho.xml
Size: 24,639 bytes
Last Modified: 2025-10-06 14:12:18
<?xml version="1.0" standalone="yes"?> <Paper uid="E89-1035"> <Title>THE SYNTACTIC REGULARITY OF ENGLISH NOUN PHRASES</Title> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> THE ANALYSIS TECHNIQUE </SectionTitle> <Paragraph position="0"> A superset of the corpus of data analysed by Sampson (1987a) was extracted from the LOB treebank using tree searching software developed by the first author and Roger Garside of Lancaster University's computing department. Following Sampson, we ignored categories G (Belles lettres, biography, essays) and P (Romance and love story) from the treebank data-base.</Paragraph> <Paragraph position="1"> The omission of this treebank data merely reflects the state of development of the treebank at the time when Sampson undertook his experiment. However, Sampson also ignored coordination because he felt that coordination reduction and such phenomena would create &quot;special complications&quot;. We include results for the coordinated examples because the ANLT grammar contains the required rules. In other respects, the initial samples are identical; both being drawn from an identical 38,212 word sample from the treebank.</Paragraph> <Paragraph position="2"> Of the 10,150 NPs in this sample of the treebank, 17 were rejected because they were incorrectly analysed and either were not, in fact, NPs or else the boundaries of the putative NP were incorrectly marked and, therefore, our access software failed. The remaining 10,133 NPs were initially sorted into single and multi constituent NPs (according to the LOB model of analysis). Single constituent NPs were further sorted according to the incidence and order of their immediate lexical constituents and multi constituent NPs according to the incidence, order and attachment of their immediate daughters. At this point, we discarded a further 119 NPs which were tagged in a way which indicated they contained either foreign phrases (for example, fait accomplO or mathematical formulae and symbols. These are tagged but not analysed internally in the treebank.</Paragraph> <Paragraph position="3"> We assume that they are irrelevant to the syntax of English NPs. These steps resulted in 10,014 NPs being sorted into 2358 distinct NP types. These types must be identical with Sampson's initial analysis (modulo the inclusion of coordination and exclusion of formulae and foreign phrases) because they are based entirely on the literal form of the tags in the LOB treebank.</Paragraph> <Paragraph position="4"> The next stage of our analysis was to semi-automatically reduce these 2358 NP types into fewer types by collapsing together tags on the basis of grammatical generalisations exploited in the ANLT grammar rules and implicit in the LOB tag names. For example, there is no purpose in treating NPs identical apart from the number of the head noun as distinct (although they are tagged distinctly) because the ANLT grammar will deploy precisely the same set of rules to analyse them.</Paragraph> <Paragraph position="5"> Sampson (1987a) also collapsed types by generalising across tags, however, he gives no details of this procedure, so it is impossible to quantify the extent to which our analyses diverged at this point. Following Sampson, we ignored the internal structure of post-modifiers (such as PPs, relative clauses, etc.) and of possessive premodifiers. However, in order not to trivialise the experiment we analysed the same set of lexical data covered by his analysis regardless of whether lexical items are treated as immediate constituents of NP in the ANLT grammar. For example, - 257 sequences of simple adjectival or possessive premodifiers are directly attached to the topmost NP node in the treebank, so we consider these cases in our results.</Paragraph> <Paragraph position="6"> We also performed some manual editing of the LOB examples to remove punctuation. The ANLT grammar contains no rules referring to punctuation since we do not regard punctuation as a syntactic phenomenon.</Paragraph> <Paragraph position="7"> However, where punctuation reflects a genuine syntactic distinction (such as that between restrictive and non-restrictive postmodification), examples were classified appropriately. This approach probably gives us a slight edge over Sampson in terms of the generalising power of our rules, but we do not regard this as pernicious because we do not recognise a syntactic difference between examples such as the man with red shoes in the park and the man with red shoes, in the park, gjven the semantically intuitive analysis. 48 NPs contained brackets, of which 34 signalled appositional or parenthetical material. The appositional cases were parsed with brackets deleted. The parenthetical cases were counted as failures (see below for further discussion). In 8 of the remaining cases, the brackets were internal to an embedded constituent and were, therefore, irrelevant. 3 further examples contained point numbering or marking (i.e. a)... b)...) conventions and the final 3 enclosed ordinary modifiers. These 6 examples were parsed with brackets and numbering/marking conventions removed.</Paragraph> <Paragraph position="8"> These steps resulted in 707 distinct NP types.</Paragraph> <Paragraph position="9"> Sampson (1987a) found 747 types. When one considers that punctuation will have increased the number of types he found, it seems likely that we have probably reanalysed the data in a manner quite similar to his original analysis. One token of each of the 707 revised types of NP was parsed using the ANLT grammar NP rules. Initially, we attempted to perform this analysis automatically using the ANLT project parser in batch mode. The words in the example to be parsed were replaced with their lexical tags and a 'lexicon' was created relating tags to lexical syntactic categories in the ANLT grammar. Data from the treebank and other data from two different corpora were parsed in this fashion and the output was manually analysed to select the semantically correct analysis, weed out 'false positives' where the system had assigned one or more incorrect analyses, and to diagnose the reasons for parse failure.</Paragraph> <Paragraph position="10"> Failures occurred beth because of inadequacies in grammatical coverage and because of resource limitations with some long and multiply-ambiguous NPs. The resulting data contained many cases of multiple analyses of the type expected using a grammar containing rules to handle PP attachment and compounding (see, for example, Church & Patil, 1982). The intention was to compute the frequency with which each rule of the grammar applied and the overall success rate of the grammar/parser from these manually edited files. However, the process of evaluating and searching for correct analyses amongst very high numbers of automatically generated parses required more effort than manually applying the rules to check that the semantically correct analysis could be produced. This problem highlights the need for automatic semantic 'filtering' of the parses produced, but, in the absence of a fairly comprehensive and sophisticated lexical and compositional semantic component, this was not possible.</Paragraph> <Paragraph position="11"> Therefore, we completed the analysis of one token of each of the 707 NP types by manually applying the ANLT grammar to check that the semantically * appropriate analysis could be produced. When the correct parse was available, the rules used in this analysis were recorded. We derived a numerical index of the generality of each rule by counting each application and multiplying it by the number of tokens in each type exemplified by the parsed example.</Paragraph> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> RESULTS </SectionTitle> <Paragraph position="0"> 622 of the 707 examples were parsed successfully, yielding a success rate of 87.97% When the success rate takes account of the frequency of each NP type in the sample and indicates the proportion of successful NP parses which would be achieved by the ANLT system for this data, the figure rises to 96.88% or 9702 NPs parsed successfully out of the 10,014 sample.</Paragraph> <Paragraph position="1"> The analyses utilised a total of 54 distinct rules expressed in the ANLT 'object grammar' formalism. Of these 8 were additions prompted by the experiment: 3 for names (Mr. Joe Bloggs), I for noun compounding (water meter), 2 for adverbial pre- and post-modification (nearly a century), 1 for possessive NPs dominated by N-bar (the America's cup), and 1 for NPs with adjectival heads (the poor). We added these rules because they express uncontroversial generalisations and represent 'oversights' in the development of the grammar rather than ad hoc additions solely for the purposes of the experiment.</Paragraph> <Paragraph position="2"> These object grammar rules were produced by 7 linear precedence statements, 4 rules of feature propagation, 6 feature default rules, 3 metarules, and 50 immediate dominance rules in the metagrammar. Although the metagrammar is the 'seat of linguistic generalisations' in our system, parsing proceeds in terms of a compiled object grammar derived from these meta-grammatical statements. Therefore, statistics concerning rule application will be associated with the object grammar.</Paragraph> <Paragraph position="3"> We counted the number of times each of the 54 object grammar phrase-structure rules would apply in the analysis of all the parsable examples in the sample. The categories of these object grammar rules still contain features with varlable-values which will be instsntiated at parse time by unification. They are therefore considerably more general than similar rules with atomic or nearly-atomic categories (of the kind which are implicit in the treebank analyses and resulting NP types). Table 1 below presents these results. The rules used end their corresponding names are a superset of those described in Grover et al. (1987). Grover et al. (1989) describes in detail all the rules used below.</Paragraph> <Paragraph position="5"> N conjunct, with coordinator and coordination of N or coordination of N, all conjuncts with same PLU value and coordination of N1 or coordination of N1, all conjunets PLU or coordination of N1, all conjuncts PLU + and coordination of N2 and coordination of N2 but no coordinators (i.e. a list) both.and coordination of N2 or coordination of N2, all conjuncts PLU or coordination of N2, differing PLU values or coordination of N2, all conjunets PLU +</Paragraph> <Paragraph position="7"> There are a number of reasons why some of these figures are slightly misleading. For example, some low numbers are an artifact of the preliminary analysis into types. Thus, N2+/PRO(FOOT9), which would be utilised to parse NPs consisting of wh-pronouns, such as who, what, and so forth, only applies once. In the preliminary analysis, we decided to collapse together tags for the wh and non-wh version of the same category. It is just an accident that in all of the representative tokens of each type which were parsed, only one wh-pronoun turned up and this happened to represent a singleton type.</Paragraph> <Paragraph position="8"> Similarly, N1/SFIN only applies twice, but it is probable that there are more examples of nouns taking sentential complements as arguments in the sample. The LOB tagset represents these complements by 'Fn' and relative clauses by 'Fr'. Following Sampson, we collapsed all of these to 'F'. Consequently, the bulk of the sentential complements were incorrectly added to the types involving postmodification by relative clauses. These problems are unavoidable, given the particular assumptions built into the LOB treebank analyses, unless a completely new analysis of the sample was undertaken.</Paragraph> <Paragraph position="9"> One way of ameliorating this problem is to collapse some of the distinct rules in Table 1. A number of the distinct object grammar rules are present for 'technical' reasons connected with the use of fixed-arity unification and feature propagation by variable binding in the ANLT grammar formalism and parser (see Briscoe et al., 1987b,c for details). Therefore, we reduced the 54 object grammar rules to 36 hypothetical rules using our judgement to determine whether a distinction between rules was motivated by a linguistic generalisation or a technical consideration peculiar to the ANLT grammar formalism. In most cases, the linguistic generalisation is, in fact, present in the metagrammar rules but 'compiled out' in the automatic production of the equivalent object grammar. For example, rules with 'FOOT' in their name are wh-variants of other rules defined by metarules which state the manner in which they differ (systematically) from the non-wh versions. The resulting 36 hypothetical rules are given in Table 2 along with new rule application counts based on summing the counts for the merged actual rules. We also give the figures for the number of times each rule applied in the parsing of one token of each type. The final column presents a 'proportioned-up' figure based on multiplying the second column by 15.6 (since the parsed tokens represent 6.41% of the total sample). This column gives another perspective on the 'generalising power' of the rules involved.</Paragraph> </Section> <Section position="6" start_page="0" end_page="0" type="metho"> <SectionTitle> COMPARISON OF RULES AND TYPES </SectionTitle> <Paragraph position="0"> We suggested above that Sampson's argument against the generative concept of grammaticality is based on the assumption that each type in his original analysis will be associated with one nile. Sampson (1978a) found 747 types of which 468 were singleton types containing only one token, or 62.65% singleton types. In our reconstruction of Sampson's analysis we found 707 types of which 421 were singleton types, or 59.95% singleton types. Sampson's commonest type contained 1135 tokens, ours contained 1519 tokens. Sampson (1987a) presents an analysis of his data which involves plotting a frequency-ordered list of NP types against the cumulative frequency of NP tokens in types of the same or lower frequency. This allows him to predict that 'rare' types, defined in terms of rate of occurrence relative to the rate of occurrence of the commonest type, will crop up fairly often in naturally occurring samples of NPs. For instahoe, if 'rare' is defined as occurring no more than once per 1000 occurrences of the commonest type, then about one example in 16 will represent some rare type.</Paragraph> <Paragraph position="1"> Therefore, a robust parser will need many 'rules' for such 'rare' types. Furthermore, there is no reason to expect the percentage of singleton types to fall as the sample size grows, implying that a robust parser of unrestricted text deploying a finite set of generative rules is out of the question.</Paragraph> <Paragraph position="2"> Unfortunately, we cannot repeat Sampson's analysis for both our types and our rules because more than one rule is involved in the parsing of many of the types.</Paragraph> <Paragraph position="3"> Using the ANLT NP rules, an average of 5 rules applied - 260 to each parsed token exemplifying a type, this figure drops to 3.18 when we take the average for the complete sample. Therefore, there is no direct correlation between rules and types. Nevertheless, Sampson's result follows directly from the high proportion of singleton types in his analysis and his assumption that one rule will suffice for each type; as he writes &quot;although a rare type is by definition represented by fewer tokens in a sample than a common type, as we move to lower type-frequencies the number of types possessing those frequencies grows, so that the total proportion of tokens representing all &quot;rare&quot; types remains significantly large even when the threshold of &quot;rarity&quot; is set at relatively extreme values.&quot; (Sampson, 1987:225, original emphasis).</Paragraph> <Paragraph position="4"> The most basic and important difference between any grammar based on a one-to-one correspondence of rules and types and one such as the ANLT grammar is the enormous difference in its size; namely, 36 or 54 rules as opposed to 707 or 747 rules - reduction by a fac-tor between 13 and 20 approximately. This alone testifies to the greater generality of the ANLT NP grarmnar rules. However, there are also big differences in the patterns of application of rules between the two approaches. We can see this by looking at an ordered list of the rarest 10 types and comparing it with similar lists for the least applied actual and hypothetical 10 ANLT rules. The first column in Table 3 shows the number of tokens or rule applications. Following columns show numbers and percentages of types or rules associated with this number of tokens or applications.</Paragraph> <Paragraph position="5"> Table 3 - 10 Least Frequent Types / -ly Applied Rules Summing the percentage values reveals that 88.92% of tokens fell into the ten rarest types, 38.89% of actual rules fell into the ten least applied classes, and 33.33% of hypothetical rules fell into the ten least applied classes for that set. Table 3 further demonstrates the greater generality of the rule-based analysis versus the type-based analysis for this sample of NPs. But in a sense, presenting the results in this manner misses the crux of Sampson's argument that any parsing system based on generative rules will need a large or open-ended set of spurious 'rules' which simply redescribe the data, because they will only apply once. In the actual rule set, 6 rules or 11.11% are dubious in this sense, but, as we argued above, these rules are only distinct for technical masons and in the hypothetical set no such rules exist. In any case, the proportion of actual dubious rules represents a considerable improvement on the proportion of singleton types (59.55%).</Paragraph> <Paragraph position="6"> In (1) we present 3 (randomly-chosen) tokens of NPs from singleton types. If Sampson's general thesis were correct, we would expect such examples to be exotic or syntactically mysterious.</Paragraph> <Paragraph position="8"> a) the old tension-bar-sprung Morris Minor b) the main existing indirect tax, purchase tax c) a basic ideological one These NPs are not problematic for the ANLT grammar and are classified as singleton types because of the nature of the lexical and syntactic analysis used in the LOB treebank. Similarly, ANLT rules which applied 'rarely', such as N1/VPINF (6 times) or N1/INFMOD (2 times), which would apply in the parsing of desire to grow up and man to ask respectively, do not encode controversial or doubtful generalisations. Although the actual frequency of such constructions in English may well be low.</Paragraph> </Section> <Section position="7" start_page="0" end_page="0" type="metho"> <SectionTitle> THE FAILURES </SectionTitle> <Paragraph position="0"> It is instructive for similar reasons to examine those examples that the ANLT grammar failed to parse. If Sampson's general thesis were correct' we should expect these to fall into singleton types and be syntactically exotic or mysterious. In fact, they are relatively easy to classify and the failure of the ANLT grammar results from either intentional or in some cases unintentional 'oversights' in the NP grammar. The failures can be classified, as illustrated in Table 4.</Paragraph> <Paragraph position="1"> Odd numbers include examples like 2 Kings 25 : 25 , 6, and so forth. No rule was included in the grammar for dates, although these all consist of day (written 10 or lOth), month (unabbreviated), and year (in numerals). In 2 of the 4 cases the order of day and month is reversed. Ellipsis of the head noun in cases where there is a posmaodifier, for example, those who perpetuate it, causes a problem for the ANLT grammar because the determiner those cannot be analysed as a pronoun since - 261 the grammar blocks modification of pronouns. This problem accounts for all the failures in this class.</Paragraph> <Paragraph position="2"> Parenthetical or intrusive material which is not in apposition comes in two kinds. Firstly, there are cases of grammatical modification which occurs between the head noun and its arguments, as (2) illustrates.</Paragraph> <Paragraph position="3"> (2) our failure over two centuries to sustain any strong national musical tradition of our own These are not parsed as a result of the rigid assumptions about the ordering of arguments and modifiers built into the grammar. These need to be relaxed on the basis of some theory of 'heaviness' and its effect on order.</Paragraph> <Paragraph position="4"> Secondly, there are cases of genuine intrusive interjection or interpolation, as (3) illustrates.</Paragraph> <Paragraph position="5"> (3) little capsules , this big , - he brandished a teaspoon - with hundreds of tiny little red men inside them Such inwasive material can occur in most positions from a syntactic perspective. We suspect that a theory concerning their distribution would be largely pragmatic. Some cases of 'right-node raising' of phrases are covered by the ANLT grammar. However, there is no rule for 'right-node raising' of nouns which would appear to be needed in NPs such as late 19th- and early 20th-century Rumania. Similarly, the grammar restricts NP premodifiers to AP, but a number of non-AP premodifiers occurred in the sample. These mostly involved measure phrases of some form, such as a 6 p.c tax free distribution, the 24fl passenger cabin, or the 5 shilling shares. There are 4 cases of unlike category coordination in AP modifiers like music both manuscript and printed and wine-glass or flared heels.</Paragraph> <Paragraph position="6"> The ANLT grammar allows this in post-copular position.</Paragraph> <Paragraph position="7"> but clearly the relevant generalisations should be extended to AP pre- and post-modifiers.</Paragraph> <Paragraph position="8"> There are a number of cases where a premodifier selects a particular postmedifier. Comparative constructions with more and than are a well-known type which the ANLT grammar covers. However, there are many other more or less idiomatic phrases of this type, some of which could probably be subsumed by an expanded treatment of comparatives along existing lines, some of which could not. We give illustrative examples in (4).</Paragraph> <Paragraph position="9"> (4) such a crazy spin that I.~slie could not cope with it as much God's handiwork as a man as little as 0.001 at % of the addition elements In addition, the rule for noun compounding we have included does not allow compounds to contain anything other than lexical nouns. Cases of adjectives in compounds were treated as 'successes' by allowing the rule N/ADJ which converts adjectives such as poor to norms to deal with ellipsis of the head noun in the poor to overapply to adjectives in compounds. In this area, the ANLT grammar is clearly inadequate and needs improvement in obvious directions. The rule N/ADJ should be replaced by a lexical rule which states that '+human' adjectives can function as nouns, and compounding rules should be allowed to cross the 'boundary' between morphology and syntax, perhaps by allowing N-bar categories as well as nouns to 'compound'. These modifications would allow the illustrative examples in (5) to be counted as successes. (5) the third geologists' association excursion our well organised after care departments The miscellaneous class contains 2 types where each occurs at the NP boundary, such as silicon , copper and magnesium each. We suspect that in these examples each should be treated as an adverbial modifier of the following VP. There are two types containing the phrase all but as part of a partitive, some cases of words, such as no one occurring unhyphened, and one or two more exotic examples illustrated in (6).</Paragraph> <Paragraph position="10"> (6) in 17 something Newton discovered gravity ' a man on the roof ' by Kathleen Sully , Peter Davies, 15 shillings A final example worthy of consideration is given in (7). (7) the company's Caravelle schedules London-Brussels and onwards from Athens to various points...</Paragraph> <Paragraph position="11"> This could be classified as a case of non-constituent coordination of NP and PP postnominally or as a case of specialised ellipsis of from before London in 'travelagent-speak'. null</Paragraph> </Section> class="xml-element"></Paper>