File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/01/w01-0713_evalu.xml

Size: 6,284 bytes

Last Modified: 2025-10-06 13:58:45

<?xml version="1.0" standalone="yes"?>
<Paper uid="W01-0713">
  <Title>Unsupervised Induction of Stochastic Context-Free Grammars using Distributional Clustering</Title>
  <Section position="8" start_page="0" end_page="0" type="evalu">
    <SectionTitle>
7 Evaluation
</SectionTitle>
    <Paragraph position="0"> Evaluation of unsupervised algorithms is difficult.</Paragraph>
    <Paragraph position="1"> One evaluation scheme that has been used is to compare the constituent structures produced by the grammar induction algorithm against a treebank, and use PARSEVAL scoring metrics, as advocated by (van Zaanen and Adriaans, 2001); i.e. use exactly the same evaluation as is used for supervised learning schemes. This proposal fails to take account of the fact that the annotation scheme used in any corpus, does not reflect some theory-independent reality, but is the product of various more or less arbitrary decisions by the annotators (Carroll et al., 1998). Given a particular annotation scheme, the structure in the corpus is not arbitrary, but the choice of annotation scheme inevitably is. Thus expecting an unsupervised algorithm to converge on one particular annotation scheme out of many possible ones seems overly onerous.</Paragraph>
    <Paragraph position="2"> It is at this point that one must question what the point of syntactic structure is: it is not an end in itself but a precursor to semantics. We need to have syntactic structure so we can abstract over it when we learn the semantic relationships between words. Seen in this context, the suggestion of evaluation based on dependency relationships amongst words(Carroll et al., 1998) seems eminently sensible.</Paragraph>
    <Paragraph position="3"> With unsupervised algorithms, there are two aspects to the evaluation; first how good the annotation scheme is, and secondly how good the parsing algorithm is - i.e. how accurately the algorithm assigns the structures. Since we have a very basic non-lexicalised parser, I shall focus on evaluating the sort of structures that are produced, rather than trying to evaluate how well the parser works. To facilitate comparison with other techniques, I shall also present an evaluation on the ATIS corpus.</Paragraph>
    <Paragraph position="4"> Pereira and Schabes (1992) establish that evaluation according to the bracketing accuracy and evaluation according to perplexity or cross-entropy are very different. In fact, the model trained on the bracketed corpus, although scoring much better on bracketing accuracy, had a higher (worse) perplexity than the one trained on the raw data. This means that optimising the likelihood of the model may not lead you to a linguistically plausible grammar.</Paragraph>
    <Paragraph position="5"> In Table 3 I show the non-terminals produced during the first 20 iterations of the algorithm.</Paragraph>
    <Paragraph position="6"> Note that there are less than 20 of them, since as mentioned above sometimes we will add more rules to an existing non-terminal. I have taken the  Note that three of them are recursive.</Paragraph>
    <Paragraph position="7"> liberty of attaching labels such as NP to the non-terminals where this is well justified. Where it is not, I leave the symbol produced by the program which starts with NT-. Table 5 shows the most frequent rules expanding the NP non-terminal.</Paragraph>
    <Paragraph position="8"> Note that there is a good match between these rules and the traditional phrase structure rules.</Paragraph>
    <Paragraph position="9"> To facilitate comparison with other unsupervised approaches, I performed an evaluation against the ATIS corpus. I tagged the ATIS corpus with the CLAWS tags used here, using the CLAWS demo tagger available on the web, removed empty constituents, and adjusted a few tokenisation differences (at least is one token in the BNC.) I then corrected a few systematic tagging errors. This might be slightly controversial. For example, &amp;quot;Washington D C&amp;quot; which is three tokens was tagged as NP0 ZZ0 ZZ0 where ZZ0 is a tag for alphabetic symbols. I changed the ZZ0 tags to NP0. In the BNC, that I trained the model on, the DC is a single token tagged as NP0, and in the ATIS corpus it is marked up as a sequence of three NNP. I did not alter the mark up of flight codes and so on that occur frequently in this corpus and very infrequently in the BNC.</Paragraph>
    <Paragraph position="10"> It is worth pointing out that the ATIS corpus is a very simple corpus, of radically different structure and markup to the BNC. It consists primarily of short questions and imperatives, and many sequences of letters and numbers such as T W A, A P 5 7 and so on.</Paragraph>
    <Paragraph position="11"> For instance, a simple sentence like &amp;quot;Show me the meal&amp;quot; has the gold standard parse:  According to this evaluation scheme its recall is only 33%, because of the presence of the non-branching rules, though intuitively it has correctly identified the bracketing. However, the crossing brackets measures overvalues these algorithms, since they produces only partial parses - for some sentences my algorithm produces a completely flat parse tree which of course has no crossing brackets.</Paragraph>
    <Paragraph position="12"> I then performed a partial parse of this data using the SCFG trained on the BNC, and evaluated the results against the gold-standard ATIS parse using the PARSEVAL metrics calculated by the EVALB program. Table 6 presents the results of the evaluation on the ATIS corpus, with the results on this algorithm (CDC) compared against two other algorithms, EMILE (Adriaans et al., 2000) and ABL (van Zaanen, 2000). The comparison presented here allows only tentative conclusions for these reasons: first, there are minor differences in the test sets used; secondly, the CDC algorithm is not completely unsupervised at the moment as it runs on tagged text, whereas ABL and EMILE run on raw text, though since the ATIS corpus has very little lexical ambiguity the difference is probably quite minor; thirdly, it is worth reiterating that the CDC algorithm was trained on a radically different and much more complex data set. However, we can conclude that the CDC algorithm compares favourably to other unsupervised algorithms.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML