File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/00/c00-1085_evalu.xml

Size: 10,509 bytes

Last Modified: 2025-10-06 13:58:32

<?xml version="1.0" standalone="yes"?>
<Paper uid="C00-1085">
  <Title>Estimation of Stochastic Attribute-Value Grammars using an Informative Sample</Title>
  <Section position="7" start_page="589" end_page="591" type="evalu">
    <SectionTitle>
6 Experiments
</SectionTitle>
    <Paragraph position="0"> Here we present two sets of experiments. The first set demonstrate the existence of an informative sample. It also shows some of the characteristics of three smnpling strategies. The second set of experiments is larger in scale, and show RFMs (both lexicalised and unlexicalised) estimated using sentences up to 30 tokens long. Also, the effects of a Gaussian prior are demonstrated as a way of (partially) dealing with overfitting.</Paragraph>
    <Section position="1" start_page="589" end_page="590" type="sub_section">
      <SectionTitle>
6.1 Testing the Various Sampling
Strategies
</SectionTitle>
      <Paragraph position="0"> In order to see how various sizes of sample related to estimation accuracy and whether we could achieve similar levels of performm~ce without recovering all possible parses, we ran the following experiments.</Paragraph>
      <Paragraph position="1"> We used a model consisting of features that were defined using all three templates. We also threw away all features that occurred less than two times in the training set. We randomly split; the Wall Street Journal into disjoint training, held-out and testing sets. All sentences in the training and held-out sets were at most 14 tokens long. Sentences ill the testtug set, were at most 30 tokens long. There were 6626 sentences in the training set, 98 sentences in the held-out set and 441 sentences in tile testing set.</Paragraph>
      <Paragraph position="2"> Sentences in the held-out set had on average 12.6 parses, whilst sentences in the testing-set had on average 60.6 parses per sentence.</Paragraph>
      <Paragraph position="3"> The held-out set was used to decide which model performed best. Actual performmme of the models should be judged with rest)ect to the testing set.</Paragraph>
      <Paragraph position="4"> Evaluation was in terIns of exact match: tbr each sentence in the test set, we awarded ourselves a t)oint if the RFM ranked highest the same parse that was ranked highest using the reference probabilities.</Paragraph>
      <Paragraph position="5"> When evahmting with respect to the held-out set, we recovered all parses for sentences in the held-out set. When evaluating with respect to the testing-set, we recovered at most 100 parses per sentence.</Paragraph>
      <Paragraph position="6"> For each run, we ran IIS for the same number of iterations (20). In each case, we evaluated the RFM after each other iteration and recorded the best classification pertbrmance. This step was designed to avoid overfitting distorting our results.</Paragraph>
      <Paragraph position="7"> Figure 2 shows the results we obtained with possible ways of picking 'typical' samples. The first column shows the maxinmm number of parses per sentences that we retrieved in each sample.</Paragraph>
      <Paragraph position="8"> The second column shows the size of the sample (in parses).</Paragraph>
      <Paragraph position="9"> The other cohmms give classification accuracy results (a percentage) with respect to the testing set. In parentheses, we give performance with respect; to the held-out set.</Paragraph>
      <Paragraph position="10"> The column marked Rand shows the performance  of runs that used a sample that contained parses which were randomly and uniformly selected out of the set, of all possible parses. The classification accuracy results for this sampler are averaged over 10 runs.</Paragraph>
      <Paragraph position="11"> The column marked SCFG shows the results obtained when using a salnple that contained 1)arses that were retrieved using the probabilistic unI)acking strategy. This did not involve retrieving all possible parses for each sentence in the training set,. Since there is no random component, the results arc fl'om a single run. Here, parses were ranked using a stochastic context free backbone approximation of TSG. Parameters were estimated using simple counting.</Paragraph>
      <Paragraph position="12"> FinMly, the eohunn marked Ref shows the results ol)tained when USillg a sample that contained the overall n-best parses per sentence, as defined in terms of the reference distril)ution.</Paragraph>
      <Paragraph position="13"> As a baseline, a nlodel containing randomly assigned weights produced a classification accuracy of 45% on the held-out sentences. These results were averaged over 10 runs.</Paragraph>
      <Paragraph position="14"> As can be seen, increasing the sainple size produces better results (for ca&amp; smnl)ling strategy). Around a smnple size of 40k parses, overfitting starts to manifest, and perIbrmance bottoms-out. One of these is therefore our inforinative sample. Note that the best smnple (40k parses) is less than 20% of the total possible training set.</Paragraph>
      <Paragraph position="15"> The ditference between the various samplers is marginal, with a slight preference for Rand. However the fact that SUFG sampling seems to do ahnost as well as Rand sampling, and fllrthermore does not require unpacking all parses, makes it the sampling strategy of choice.</Paragraph>
      <Paragraph position="16"> SCFG sampling is biased in the sense that the sample produced using it will tend to concentrate around those parses that are all close to the best, parses. Rand smnpling is unbiased, and, apart h'om the practical problems of having to recover all parses, nfight in some circumstances be better than SCFG sampling. At the time of writing this paper, it was unclear whether we could combine SCFG with Rand sampling -sample parses from the flfll distribu- null lion without unpacking all parses. We suspect that for i)robabilistic unt)acking to be efficient, it nmst \]:ely upon some non-uniform distribution. Unpacking randomly and uniformly would probably result in a large loss in computational eliiciency.</Paragraph>
    </Section>
    <Section position="2" start_page="590" end_page="591" type="sub_section">
      <SectionTitle>
6.2 Larger Scale Evaluation
</SectionTitle>
      <Paragraph position="0"> Here we show results using a larger salnl)le and testing set. We also show the effects of lexicalisation, overtitting, and overfitting avoidance using a Gaussian prior. Strictly speaking this section could have been omitted fl'om the paper. However, if one views estimation using an informative sami)le as overfitling avoi(lance, then estimation using a Gaussian l)rior Call be seen as another, complementary take on the problem.</Paragraph>
      <Paragraph position="1"> The experimental setup was as follows. We ralldomly split the Wall St, reel: Journal corpus into a training set and a testing set. Both sets contained sentence.s t;hat were at most 30 tokens hmg. When creating the set of parses used to estimate Ii.FMs, we used the SCFG approach, and retained the top 25 parses per sentence. Within the training set (arising Dora 16, 200 sentences), there were 405,020 parses.</Paragraph>
      <Paragraph position="2"> The testing set consisted of 466 sentences, with an average of 60.6 parses per sentence.</Paragraph>
      <Paragraph position="3"> When evahmtillg, we retrieved at lllOSt 100 lmrscs per sentence in the testing set and scored them using our reference distribution. As lmfore, we awarded ourselves a i)oinl; if the most probable testing parse (in terms of the I/.MF) coincided with the most t)rol)able parse (in terms of the reference distribution). In all eases, we ran IIS tbr 100 iterations.</Paragraph>
      <Paragraph position="4"> For the tirst experiment, we used just the first telnp\]at('. (features that rc'la.t(;d to DC(I insl;antiations) to create model l; the second experiment uso.d the first and second teml)lat(~s (additional t'eatm'o.s relating to PP attachment) to create model 2. The linal experiment used all three templat('~s (additional fea,tllres that were head-lexicalised) to create model 3.</Paragraph>
      <Paragraph position="5"> The three mo(lels contained 39,230, 65,568 and 278, 127 featm:es respectively, As a baseline, a model containing randomly assigned weights achieved a 22% classification accuracy. These results were averaged over 10 runs. Figure 3 shows the classification accuracy using models 1, 2 and 3.</Paragraph>
      <Paragraph position="6"> As can 1)e seen, the larger scale exl)erimental results were better than those achieved using the smaller samples (mentioned in section 6.1). The rea-Sell for this was because we used longer sentc,11ces. The. informative sainple derivable Kern such a training set was likely to be larger (more representative of  Estinmted using a Gmlssian Prior and IIS the population) than the informative sample derival)led from a training set using shorter, less syntat'tically (Xmll)lex senten(:es. With the unle.xicalised model, we see (:lear signs of overfitting. Model 2 overfits even more so. For reasons that are unclear, we see that the larger model 3 does not ai)pem: to exhibit overtitting.</Paragraph>
      <Paragraph position="7"> We next used the Gaussian Prior method of Chen and Rosenfeld to reduce overfitting (Chen and Rosenfeld, 1999b). This involved integrating a Gaussian prior (with a zero mean) into Ills and searching for the model that maximised the, product of the likelihood and prior prolmbilities. For the experiments reported here, we used a single wlriante over the entire model (better results might be achievable if multiple variances were used, i)erhaps with one variance per telnl)late type). The aetllal value of the variance was t'cmnd by trial-and-error.</Paragraph>
      <Paragraph position="8"> Itowever, optimisation using a held-out set is easy to achieve,.</Paragraph>
      <Paragraph position="9">  We repeated the large-scale experiment, but this time using a Gaussian prior. Figure 4 shows the classification accuracy of the models when using a Gmlssian Prior.</Paragraph>
      <Paragraph position="10"> When we used a Gaussian prior, we fotmd that all models showed signs of imt)rovenmnt (allbeit with varying degrees): performance either increased, or else did not decrease with respect to the munber of iterations, Still, model 2 continued to underperform. Model 3 seemed most resistent to the prior. It theretbre appears that a Gaussian prior is most useful for unlexicalised models, and that for models built from complex, overlapping features, other forms of smoothing must be used instead.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML