XML Viewer - w97-1011

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/concl/97/w97-1011_concl.xml
Size: 3,957 bytes
Last Modified: 2025-10-06 13:57:56
<?xml version="1.0" standalone="yes"?>
<Paper uid="W97-1011">
  <Title>Learning and Application of Differential Grammars</Title>
  <Section position="8" start_page="0" end_page="0" type="concl">
    <SectionTitle>
10 Conclusions
</SectionTitle>
    <Paragraph position="0"> Differential Grammars allow high-order Ngram statistics to be focussed on the problem of deciding between the correct and one or more incorrect tokens, reducing Ngram contexts to environments based on high frequency eigentokens: words, numbers, punctuation and affixes. Using the 150 Unix eigenwords gives us a 50% likelihood that we will have a hit in any slot, while our 12 non-zero suffixes increases the coverage to 25%, ensuring that good syntactic relevance is obtained.</Paragraph>
    <Paragraph position="1"> In further smaller experiments, we demonstrated that the 'a/an' distinction could be handled by splitting our suffix and open eigenunits into vowel and consonant subclasses. We further demonstrated that similarly appropriate eigenunits could be automatically derived on a discounted frequency basis, using a crude heuristic to order the potential eigenunits, while restricting them to the form of lexical, space-bounded, words. Experiments involving training with an automatically derived eigenset have yet to be performed, and will focus on deciding the optimum size of eigenset and development of an improved heuristic.</Paragraph>
    <Paragraph position="2"> The eigenset has two functions: to allow us to reduce the size of the tree for a given performance level, and to allow us to reduce the role of genre and semantic related fluctuations in word frequencies by concentrating on features of relatively high syntactic significance. Increasing the size of the eigenset is expected to decrease performance due to increased noise after a certain point. Similarly, increasing the size of the eigenset may eventually tend to increase the size of the stored differential grammars without significant gain in precision.</Paragraph>
    <Paragraph position="3"> The use of a significance factor in the training stage allowed the size of the trees generated by the differential grammar generator to be limited to what was necessary to achieve that level of precision on the training corpus, whilst the likelihood values stored in the tree allowed the user to be informed of the likelihood of an error (using colour or upon query), and to control the threshold for which errors would be reported.</Paragraph>
    <Paragraph position="4"> Maximum diameter is another parameter of the training stage, and experiments on optimal size and the role of diameter in relation to syntactic and semantic words were undertaken early on in setting 10 as the size beyond which environments were unlikely to reach significance. If generation of the grammar was stopped due to lack of significance, the problem was often lack of data. If the search was terminated at maximum diameter it was an indication that the words were functionally similar, and most likely the same part of speech.</Paragraph>
    <Paragraph position="5"> The differential grammar approach has proven to be a successful way of applying statistical, Ngramlike, techniques for practical grammar-checking in a modest computing environment, with useful grammar trees requiring of the order of 100 to 1000 bytes of storage per confused word pair in most cases. This report has concentrated on presenting empirical results for a single system, rather than on optimization of the system, and there remains considerable scope for investigation of the role of the system parameters and optimizing the eigenset, for which only the primary considerations have been outlined. The primary deficiency of the system is its inability to cope with arbitrarily long parentheses or subclauses which separate syntactically bound elements, but it is also rather sensitive the genre and representativeness of the training corpus.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML