File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/00/c00-1085_intro.xml
Size: 6,170 bytes
Last Modified: 2025-10-06 14:00:45
<?xml version="1.0" standalone="yes"?> <Paper uid="C00-1085"> <Title>Estimation of Stochastic Attribute-Value Grammars using an Informative Sample</Title> <Section position="2" start_page="0" end_page="586" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> Abney showed that attribute-value grammars cannot be modelled adequately using statistical techniques which assume that statistical dependencies are accidental (Ablmy, 1997). Instead of using a model class that assumed independence, Abney suggested using Random Fields Models (RFMs) tbr attribute-value grmnmars. RFMs deal with the graphical structure of a parse. Because they do not make independence assumptions about the stochastic generation process that might have produced some parse, they are able to model correctly dependencies that exist within parses.</Paragraph> <Paragraph position="1"> When estimating standardly-formulated RFMs, it is necessary to sum over all parses licensed by the grammar. For many broad coverage natural language grammars, this might involve summing over an exponential number of parses. This would make the task eomtmtationally intractable. Almey, following the lead of Lafferty et al, suggested a Monte * Current address: osborne@eogsei.ed.ae.uk, University of Edinburgh, Division of Informaties, 2 Bueeleuch Place, EII8 9LW, Scotland.</Paragraph> <Paragraph position="2"> Carlo simulation as a way of reducing the computational burden associated with RFM estimation (Lafferty et al., 1997). However, Johnson ct al considered the form of sampling used in this sinmlation (Metropolis-Hastings) intractable (Johnson et M., 1999). Instead, they proposed an Mternative strategy that redefined the estimation task. It was argued that this redefinition made estimation eomtmtation-Mly simple enough that a Monte Carlo simulation was unnecessary. They presented results obtained using a small unlexicalised model trained on a modest corlms.</Paragraph> <Paragraph position="3"> Unfortunately, Johnson et al assumed it was possible to retrieve all parses licensed by a grmnmar when parsing a given training set. For us, this was not the case. In our experiments with a manually written broad coverage Definite Clause Grammar (DCG) (Briscoe and Carroll, 1996), we were only able to recover M1 parses for Wall Street .Journal sentences that were at most 13 tokens long within acceptable time and space bounds on comtmtation. When we used an incremental Minilnum Description Length (MDL) based learner to extend the coverage of our mmmally written gralnular (froul roughly 6()~ to around 90% of the parsed Wall Street .Jouriml), the situation became worse. Sentence ambiguity considerably increased. We were then only able to recover all parses for Wall Street Journal sentences that were at most 6 tokens long (Osborne, 1999).</Paragraph> <Paragraph position="4"> We can however, and usually in polynomial time, recover up to 30 parses for sentences up to 30 tokens long when we use a probabilistic unpacking mechanism (Carroll and Briscoe, 1992). (Longer sentences than 30 tokens can be parsed, but the nmnber of parses we can recover for them drops off rapidly). 1 However, 30 is far less tlmn the maximum number l&quot;vVe made an attempt to determine the maximum number of parses our grammar might assign to sentences. On a 450MIIz Ultra Spare 80 with 2 G'b of real memory, with a limit of at most 1000 parses per sentence, and allowing no more than 100 CPU seconds per sentence, we found that sentence ambiguity increased exponentially with respect to sentence length. Sentences with 30 tokens had an estimated average of 866 parses (standard deviation 290.4). Without the limit of 1000 parses per sentence, it seems likely that this average would incrc, ase.</Paragraph> <Paragraph position="5"> of parses per sentence o111' grammar mighl, assign to Wall Stl&quot;ec't Journal sent;enees. Any training set we have a(:eess to will therefore be l|eeessarily limite(l in size.</Paragraph> <Paragraph position="6"> We therefore need an estimation strategy that takes seriously the issue of extracting the 1)esl, perrefinance fl'om a limited size training Met. A limited size tra.ining sol; means one ereate(l l)y retrieving at most n t)arses per Ment(mee. Although we (:annot recover all t)ossil)le i)arses~ we (lo }lave a choice as to which llarses estimation should 1)e based Ul)On.</Paragraph> <Paragraph position="7"> Our ai)proach to the prol)lem of making I{FM estimation feasible, ibr our highly amt)iguous I)CG is to seek ol\]|; an ivformativc samt)le and train ui)on that. We (lo not redefine the estimation task in a non-sl;al~tlard w;~y, 1101' (lo we llSe a ~{o\]lte Carlo silnulation. null We (:all a salul)lc informative if it 1)oth leads to the select;ion of a 111ollol that does not mldertit or overfit, and also is typical of t'utm'e samples, l)esl)itc ()lie's intuitions, an infornmtive saml)le might be a prol)er subset of the fifll training set. This means that estinlation using the int'ornmtiv(; sample might yield 1)etter results than estimation using all of the l;rainhlg Met;.</Paragraph> <Paragraph position="8"> The ):(;st of this 1)aper is as tbllows, l,'irstly we introduce RFMs. Then we show how they nlay be esl;imated and how an infbrmative saml)le might 1)e identified. Nexl;, we give details of the, a(;tributevahle gramnlar we use, all(t show \]lOW we ~o at)otlt modelling it. We then i)resent two sets of experimel)ts. The first set is small scale, and art! de.signed to show the existent;e of ;m inti)rmative sample. The second ski of CXl)erilll(;llI, S al'e larger in scale, an(1 build upon the COml)utational savil|gS we are al)le to achieve using a probabilistic Unl)acking strategy.</Paragraph> <Paragraph position="9"> They show how large me(Ms (two orders of magnitude larger than those reported by Johnson ctal) can 1)e estimated using the l)arsed Wall Street .lourhal eort)us. Overlitting is shown to take place. They also show how this overfitting can be (partially) reduced by using a Gaussian prior. Finally, we end with SOllle COllllllelltS Oil Ollr WOl.k.</Paragraph> </Section> class="xml-element"></Paper>