File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/00/c00-1085_metho.xml

Size: 13,283 bytes

Last Modified: 2025-10-06 14:07:10

<?xml version="1.0" standalone="yes"?>
<Paper uid="C00-1085">
  <Title>Estimation of Stochastic Attribute-Value Grammars using an Informative Sample</Title>
  <Section position="3" start_page="586" end_page="587" type="metho">
    <SectionTitle>
2 Random Field Models
</SectionTitle>
    <Paragraph position="0"> Here we show how attribute-wflue grammars may be modelle(1 using RFMs. Although our commentary is in terms of RFMs and grammars, it should t)e ol)vious that RFM technology can be applied to other estimation see.narios.</Paragraph>
    <Paragraph position="1"> Let G be an attribute-value grammar, D the set of sentences within the string-set defined lly L(G) and ~ the union of the set of parses assigne(1 to each sentence in D by the gramnmr G. A Random Field Model, M, consist of two coral)orients: a set of features, F and a set; of 'wei.qhts, A.</Paragraph>
    <Paragraph position="2"> l?eatures are the basle building blocks of RFMs.</Paragraph>
    <Paragraph position="3"> They enable the system designer to spccit)~ {;lie key asl)ects of what it; takes to ditferentiate one 1)arse from a11other parse. Each feature is a t'lmetion from a 1)arse to an integer. Her(.', the integer value associated with a feature is interpreted as the nmnber of times a feature 'matches' (is 'active') with a parse. Note features should not be confllsed with features as found in feature-value t)undles (these will be called atl;ril)utes instead). \]Peatures are usually manually selected by the sysl;ein designer.</Paragraph>
    <Paragraph position="4"> The other component of a RFM, A, is a set of weights, hffornmlly, weights tell its how ti;atures are to be used when nlodellillg parses. For exanlple, an active feature with a large weight might indicate that some parse had a higtl prolmlfility. Each weight Ai is associated with a thatm'e fi. Weights arc' real-valued nmnl)ers an(l ~:H'O autonmtically deternfined 113: an estimation process (for example using hnproved Iteratire Scaling (LaflL'rty et al., 1997)). One of the nice l)roI)erties of Rl.i'Ms is that 1111o likelihood fiuwPSion of a RFM is strictly concave. This means 1;hat there ~/t&amp;quot;e. 11o h)cal lllillillla~ and so wc can be, l)e sure that sealing will result in estinmtion of a 11.1,'54 that is glol)ally ot)timal.</Paragraph>
    <Paragraph position="5"> The (unnormalised) total weight of a i)arse :c, '(J(:r), is a flulction of the. k feaLures that are 'active' on a 1)arse:</Paragraph>
    <Paragraph position="7"> The prol)ability of a parse, P(x I M), is simply the result of norm~dising the total weight associated with that parse:</Paragraph>
    <Paragraph position="9"> The inl;erpretation of this I)robability depends upon the apt)lication of tile RFM. Here, we use parse prol)abilities to rettect preferences for parses.</Paragraph>
    <Paragraph position="10"> When using RFMs for parse selection, we sin&gt; ply select the parse that ma.ximises ~/;(:1:). In these circumstances, there is 11o need to nornlalise (compute Z). Also, when comtmting ,/~(:c) for comi)eting parses, there is no built-in bias towards shorter (or longer) derivations, and so no need to normalise with respect to deriw~tion length/ 2The reason there is no need to normalisc with respect to derivation length is that features can have positive o1&amp;quot; negative weights. The weight of a parse will ttlcrcforc not always monotonically increase with respect to the re,tuber of active ti~atm'cs.</Paragraph>
  </Section>
  <Section position="4" start_page="587" end_page="587" type="metho">
    <SectionTitle>
3 RFM Estimation and Selection of
</SectionTitle>
    <Paragraph position="0"> the Informative Sample We now sketch how RFMs may be estimated and then outline how we seek out an informa.tive smnple. We use hnproved Iterative Scaling (IIS) to estimate RFMs. In outline, the IIS algorithm is as follows: null  1. Start with a reference distribution H,, a set of features F and a set of weights A. Let M be the RFM defined using F and A.</Paragraph>
    <Paragraph position="1"> 2. Initialise all weights to zero. This makes tile initial model uniform.</Paragraph>
    <Paragraph position="2"> 3. Compute the expectation of each feature w.r.t R.</Paragraph>
    <Paragraph position="3"> 4. For each feature fi (a) Find a weight ~; that equates the expectation of fi w.r.t/?, and the expectation of fi w.r.t M.</Paragraph>
    <Paragraph position="4"> (b) Ileplace the old value of ki with 21. 5. If the model has converged to/PS, output M. 6. Otherwise, go to step 4 Tile key step here is 4a, computing the expectations * of features w.r.t the RFM. This involves calculating the probability of a parse, which, as we saw fronl equation 2, requires a summation over all parses in ft.</Paragraph>
    <Paragraph position="5"> We seek out an informative sample ~l (fh C ~) as follows: I. Pick out from ~ a sample of size n.</Paragraph>
    <Paragraph position="6"> 2. Estimate a model using that smnple and evaluate it.</Paragraph>
    <Paragraph position="7"> 3. If the model just estimated shows signs of over-fitting (with respect to an unseen held-out data set), halt and output the inodel.</Paragraph>
    <Paragraph position="8"> 4. Otherwise, increase n and go back to step 1.  Our approach is motivated by tile following (partially related) observations: * Because we use a non-Imrmnetric model class and select an instance of it in terlns of some sample (section 5 gives details), a stochastic complexity argument tells us that an overly simple model (resulting from a small sample) is likely to underfit. Likewise, an overly complex model (resulting from a large sample) is likely to overfit. An informative samI)le will therefore relate to a model that does not under or overfit. * On average, an informative sample will be %ypical' of future samples. For many reaMife situations, this set is likely to be small relative to the size of the full training set.</Paragraph>
    <Paragraph position="9"> We incorporate the first observation through our search mechanism. Because we start with small sampies and gradually increase their size, we remain within the donmin of etliciently recoverable samples. The second observation is (largely) incorporated in the way we pick samples. The experimental section of this paper goes into the relevant details. Note our approach is heuristic: we cmmot afford to evahmte all 21~1 possible training sets. The actual size of the informative sample fit will depend both tile Ill)On the model class used and the maximum sentence length we can de~,l with. We would expect: richer, lexicalised models to exhibit overfitting with slnaller samples than would be the case with unlexicalised models. We would expect the size of an informative sample to increase as the maxilnum sentence length increased.</Paragraph>
    <Paragraph position="10"> There are similarities between our approach and with estimation using MDL (Rissanen, 1989). However, our implementation does not explicitly attempt to minimise code lengths. Also, there are similarities with importance sampling approaches to RFM estimation (such as (Chen and ll,osenfeld, 1999a)). However, such attempts do not miifinfise under or overfitting.</Paragraph>
  </Section>
  <Section position="5" start_page="587" end_page="587" type="metho">
    <SectionTitle>
4 The Grammar
</SectionTitle>
    <Paragraph position="0"> The grammar we model with I/andom Fields, (called the Ta 9 Sequence Grammar (Briseoe and Carroll, 1996), or TSG for short) was developed with regard to coverage, and when compiled consists of 455 Definite Clause Grammar (DCG) rules. It does not parse sequences of words directly, but instead assigns derivations to sequences of part-of-speech tags (using the CLAWS2 tagset. The grammar is relatively shallow, (for exmnple, it does not fltlly analyse unbounded dependencies) but it does make an attelnpt to deal with coilunou constructions, such as dates or names, commonly found in corpora, but of little, theoretical interest. Furthermore, it integrates into the syntax a text gramma.r, grouping utterances into units that reduce the overall ambiguity.</Paragraph>
  </Section>
  <Section position="6" start_page="587" end_page="589" type="metho">
    <SectionTitle>
5 Modelling the Grammar
</SectionTitle>
    <Paragraph position="0"> Modelling the TSG with respect to the parsed Wall Street .\]ournal consists of two steps: creation of a feature set and definition of the reference distribution. null Our feature set is created by parsing sentences in the training set (~br), and using earl, parse to illstantiate templates. Each template defines a family of features. At present, the templates we use are somewhat ad-hoc. However, they are motivated by the observations that linguistically-stipulated units (DCG rules) are informative, trod that ninny DCG apl)lications in preferred parses can be predicted using lexical information.</Paragraph>
    <Paragraph position="1">  The first template creates features that count {;lie numl)er of tinms a. DUG instantiationis i)resent within ,2 1}arse. a For examt}le , Sul)p{}s{~ we 1)arse{t the Wall Street Journal AP:</Paragraph>
    <Paragraph position="3"> A parse tree generated by TSG nfight lie as shown in figure 1. Here, to s:~ve on Sl}ace, wc have labdled each interior node in the parse tree with TSG rule names, and not attribut(;-valu(~ bun(lies. Furthermore, we have mmota.t('xl each node with the lmad w(}rd of tim l}hrase in question. Within ollr gl;aillmar, heads arc (usually) ext)lMtly marke{t. This 1110;/,118 W(~ do l\]ot \]l;~v{~ to Ill&amp;k(~ ;lily gllossos w}lcll identit\[ying the head of a. local tree. With head infoi'mtd;io\]b we are alo/e to lexicalise models. \Ve haa;e suppressed taggillg information.</Paragraph>
    <Paragraph position="4"> For {'.xamp\]e, a \]hature (h'Jin(;d using this t(;nlplat{; might (:O1111t tho, nuinber ()f times th(! we saw: AP/at I A1/at/1)1 111 a 1)arse. Such features r(~coi'd sore( 2 (if the context of the rule a.t}p\]i(:ation, in that rule al}t}Iication8 that differ ii1 terms of how attributes are bound will 1}e modelled by (litlhrent thatures.</Paragraph>
    <Paragraph position="5"> Our se{'ond total}late creates features that al'{'~ partially lexicalised. I~br each lo{:al tree (of depth one) that has a \]?P daughter, we create a feature that counts the lmmber of times that h)cal tree, de(:orated with the head-woM of the I'l', was seen in a. parse.</Paragraph>
    <Paragraph position="6"> An cxmnple of such ;1 lexicMised feature would 1}e:</Paragraph>
    <Paragraph position="8"> 3Note, all (}111&amp;quot; fo.al;/ll'es Slll)i)r(?ss ;tlly t{!l'nlillals thgtl, al)i}em' in a h}caI 1,Fe(!. Lexical informaI;ioll is in{:luded when we decide to lexicalise features.</Paragraph>
    <Paragraph position="9"> These featm'cs are designed to model PP attachments that can be resolved using the head of the PP.</Paragraph>
    <Paragraph position="10"> The thh'd mid tinM template creates featuros that are again partiMly lexicalised. This time, we create local trees of det}th one that are, decorated with the head word. For example, here is one such feature: AP/al:mfimpeded I A1/appl Note the second and third templates result in features that overlap with features resulting fl'om at)i}\]icati(ms of the first template.</Paragraph>
    <Paragraph position="11"> We create the reference distribution 1~ (an association of t)r{}l}al)i\]ities with TSG parses of sentences, such that the t}robabilities reflect 1}a.rse i)references) using the following process:  1. Extra{;t some sami}le f~T (using the al)l)roach mentioned in sc(:tion 3).</Paragraph>
    <Paragraph position="12"> 2. For each sentence in tim sample, for each l)arse  of that sent;encc', {:Olnl)ute the '(lista.ncc' between the TSG 1}mse and the WSJ refereuce parse. \]1\] our at)t)roach, dista.nce is cM{:lfla.tc(1 in tcl7111s of a weighted Slltll of crossing rates, recall and 1}recision. Mininlising it maximises our definition of parse plausibility. 4 However, there is nothing inherently crucial about this decision. Auy othc'r objective flmction (thaJ; can l)c ret)r(~sent('.(l as an CXl}Oncntial distribution) couh\] 1)e used instead.</Paragraph>
    <Paragraph position="13"> 3. Normalise the distan('es, such that for some ,sentcn(:e, tim sum of tim distances of all rt~cov-O,l.'od ~\[?SG t)al&amp;quot;S(~S \]\['(/1&amp;quot; that soii|;(!ilCO, is a COllStPStilt a.cross all sento.nces. Nornmlising in this manner ensures that each sentence is cquil)robal}le 0&amp;quot;emcmber that \]{FM probabilities are in terms of I}a.rse lir{'.fl~r{'.nces, and not probability of oc{:llrr{HIee ill 8{}111{~ (;orl)llS).</Paragraph>
    <Paragraph position="14"> 4. Map the norinalised distances into 1}robabilities. If d(p) is the normalised {listance of TSG l/;}~l&amp;quot;Se p, then associate with parse 1) the refer(race probability given by the maximum likeli-</Paragraph>
    <Paragraph position="16"> Our approach therefore gives t}artial cl'e(lit (a 11oilzero reference l)robability) to a.ll parses in ~z. /2, is thcreibr(; not as discontimlous as the equivalent distrit)ution used by Johnson at al. We therefl)re do not need to use simulated annea.ling o1' other numerically intensive techniques to cstiinate models.</Paragraph>
    <Paragraph position="17"> 4Ore' distanc(~ mo.l;ric is the same one used I}y llekto{m (ltektoen, 19.97)</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML