File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/98/p98-1115_metho.xml

Size: 9,850 bytes

Last Modified: 2025-10-06 14:14:57

<?xml version="1.0" standalone="yes"?>
<Paper uid="P98-1115">
  <Title>Compacting the Penn Treebank Grammar</Title>
  <Section position="4" start_page="699" end_page="699" type="metho">
    <SectionTitle>
3 Rule Growth and Partial
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="699" end_page="699" type="sub_section">
      <SectionTitle>
Bracketting
</SectionTitle>
      <Paragraph position="0"> Why should the set of rules continue to grow in this way? Putting aside the possibility that natural languages do not have finite rule sets, we can think of two possible answers. First, it may be that the full &amp;quot;underlying grammar&amp;quot; is much larger than the rule set that has so far been produced, requiring a much larger tree-banked corpus than is now available for its extraction. If this were true, then the outlook would be bleak for achieving near-complete grammars from treebanks, given the resource demands of producing hand-parsed text. However, the radical incompleteness of grammar that this alternative implies seems incompatible with the promising parsing results that Charniak reports (Charniak, 1996).</Paragraph>
      <Paragraph position="1"> A second answer is suggested by the presence in the extracted grammar of rules such as (1). 2 This rule is suspicious from a linguistic point of view, and we would expect that the text from which it has been extracted should more properly have been analysed using rules (2,3), i.e. as a coordination of two simpler NPs.</Paragraph>
      <Paragraph position="3"> Our suspicion is that this example reflects a widespread phenomenon of partial bracketting within the PTB. Such partial bracketting will arise during the hand-parsing of texts, with (human) parsers adding brackets where they are confident that some string forms a given constituent, but leaving out many brackets where they are less confident of the constituent structure of the text. This will mean that many rules extracted from the corpus will be 'flatter' than they should be, corresponding properly to what should be the result of using several grammar rules, showing only the top node and leaf nodes of some unspecified tree structure (where the 'leaf nodes' here are category symbols, which may be nonterminal). For the example above, a tree structure that should properly have been given as (4), has instead received</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="699" end_page="699" type="metho">
    <SectionTitle>
4 Grammar Compaction
</SectionTitle>
    <Paragraph position="0"> The idea of partiality of structure in treebanks and their grammars suggests a route by which treebank grammars may be reduced in size, or compacted as we shall call it, by the elimination of partial-structure rules. A rule that may be eliminable as a partial-structure rule is one that can be 'parsed' (in the familiar sense of context-free parsing) using other rules of the grammar. For example, the rule (1) can be parsed using the rules (2,3), as the structure (4) demonstrates. Note that, although a partial-structure rule should be parsable using other rules, it does not follow that every rule which is so parsable is a partial-structure rule that should be eliminated. There may be defensible rules which can be parsed. This is a topic to which we will return at the end of the paper (Sec. 6). For most of what follows, however, we take the simpler path of assuming that the parsability of a rule is not only necessary, but also sufficient, for its elimination.</Paragraph>
    <Paragraph position="1"> Rules which can be parsed using other rules in the grammar are redundant in the sense that eliminating such a rule will never have the effect of making a sentence unparsable that could previously be parsed. 3 The algorithm we use for compacting a grammar is straightforward. A loop is followed whereby each rule R in the grammar is addressed in turn. If R can be parsed using other rules (which have not already been eliminated) then R is deleted (and the grammar without R is used for parsing further rules). Otherwise R 3Thus, wherever a sentence has a parse P that employs the parsable rule R, it also has a further parse that is just like P except that any use of R is replaced by a more complex substructure, i.e. a parse of R.</Paragraph>
    <Paragraph position="2"> is kept in the grammar. The rules that remain when all rules have been checked constitute the compacted grammar.</Paragraph>
    <Paragraph position="3"> An interesting question is whether the result of compaction is independent of the order in which the rules are addressed. In general, this is not the case, as is shown by the following rules, of which (8) and (9) can each be used to parse the other, so that whichever is addressed first will be eliminated, whilst the other will remain.</Paragraph>
    <Paragraph position="5"> Order-independence can be shown to hold for grammars that contain no unary or epsilon ('empty') rules, i.e. rules whose righthand sides have one or zero elements. The grammar that we have extracted from PTB II, and which is used in the compaction experiments reported in the next section, is one that excludes such rules. For further discussion, and for the proof of the order independence see (Krotov, 1998). Unary and sister rules were collapsed with the sister nodes, e.g. the structure (S (NP -NULL-) (VP</Paragraph>
    <Paragraph position="7"/>
  </Section>
  <Section position="6" start_page="699" end_page="699" type="metho">
    <SectionTitle>
5 Experiments
</SectionTitle>
    <Paragraph position="0"> We conducted a number of compaction experiments: 5 first, the complete grammar was parsed as described in Section 4. Results exceeded our expectations: the set of 17,529 rules reduced to only 1,667 rules, a better than 90% reduction.</Paragraph>
    <Paragraph position="1"> To investigate in more detail how the compacted grammar grows, we conducted a third experiment involving a staged compaction of the grammar. Firstly, the corpus was split into 10% chunks (by number of files) and the rule sets extracted from each. The staged compaction proceeded as follows: the rule set of the first  pacted, and then the rules for the next 10% added, and so on. Results of this experiment are shown in Figure 2.</Paragraph>
    <Paragraph position="2"> At 50% of the corpus processed the compacted grammar size actually exceeds the level it reaches at 100%, and then the overall grammar size starts to go down as well as up. This reflects the fact that new rules are either redundant, or make &amp;quot;old&amp;quot; rules redundant, so that the compacted grammar size seems to approach a limit.</Paragraph>
  </Section>
  <Section position="7" start_page="699" end_page="699" type="metho">
    <SectionTitle>
6 Retaining Linguistically Valid
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="699" end_page="699" type="sub_section">
      <SectionTitle>
Rules
</SectionTitle>
      <Paragraph position="0"> Even though parsable rules are redundant in the sense that has been defined above, it does not follow that they should always be removed.</Paragraph>
      <Paragraph position="1"> In particular, there are times where the flatter structure allowed by some rule may be more linguistically correct, rather than simple a case of partial bracketting. Consider, for example, the (linguistically plausible) rules (10,11,12). Rules (11) and (12) can be used to parse (10), but it should not be eliminated, as there are cases where the flatter structure it allows is more linguistically correct.</Paragraph>
      <Paragraph position="3"> We believe that a solution to this problem can be found by exploiting the date provided by the corpus. Frequency of occurrence data for rules which have been collected from the corpus and used to assign probabilities to rules, and hence to the structures they allow, so as to produce a probabilistic context-free grammar for the rules. Where a parsable rule is correct rather than merely partially bracketted, we then expect this fact to be reflected in rule and parse probabilities (reflecting the occurrence data of the corpus), which can be used to decide when a rule that may be eliminated should be eliminated. In particular, a rule should be eliminated only when the more complex structure allowed by other rules is more probable than the simpler structure that the rule itself allows.</Paragraph>
      <Paragraph position="4"> We developed a linguistic compaction algorithm employing the ideas just described.</Paragraph>
      <Paragraph position="5"> However, we cannot present it here due to the space limitations. The preliminary results of our experiments are presented in Table 1.</Paragraph>
      <Paragraph position="6"> Simple thresholding (removing rules that only occur once) was also to achieve the maximum compaction ratio. For labelled as well as unlabelled evaluation of the resulting parse trees we used the evalb software by Satoshi Sekine. See (Krotov, 1998) for the complete presentation of our methodology and results.</Paragraph>
      <Paragraph position="7"> As one can see, the fully compacted grammar yields poor recall and precision figures. This can be because collapsing of the rules often produces too much substructure (hence lower precision figures) and also because many longer rules in fact encode valid linguistic information. However, linguistic compaction combined with simple thresholding achieves a 58% reduction without any loss in performance, and 69% reduction even yields higher recall.</Paragraph>
    </Section>
  </Section>
  <Section position="8" start_page="699" end_page="699" type="metho">
    <SectionTitle>
7 Conclusions
</SectionTitle>
    <Paragraph position="0"> We see the principal results of our work to be the following: * the result showing continued square-root growth in the rule set extracted from the</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML