XML Viewer - c00-1051

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/00/c00-1051_metho.xml
Size: 10,249 bytes
Last Modified: 2025-10-06 14:07:10
<?xml version="1.0" standalone="yes"?>
<Paper uid="C00-1051">
  <Title>Committee-based Decision Making in Probabilistic Partial Parsing</Title>
  <Section position="4" start_page="348" end_page="349" type="metho">
    <SectionTitle>
2 Probabilistic partial parsing
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="348" end_page="348" type="sub_section">
      <SectionTitle>
2.1 Dependency probability
</SectionTitle>
      <Paragraph position="0"> In this t)at)er, we consider the task of (le(:iding the det)endency structure of a Jat);mese input sentence. Note that, while we restrict ore: discussion to analysis of Jat)anese senl;(;nc(;s in this t)~l)er, what we present l)elow should also t)e strnightfi?rwardly ?xt)plical)h~ to more wideranged tasks such as English det)endency analysis just like the t)roblem setting considered t)y Collins (1996).</Paragraph>
      <Paragraph position="1"> Givell ;m inl)ut sentence ,s as a sequence, of B'unset,su-t)hrases (BPs) J, lq b2 ... lh~, our task is to i(tent, i\[y their inter-BP del)endency struct,,e n = l,j)l,: = ',,,}, where (tenot;es that bi (let)on(Is on (or modities) bj.</Paragraph>
      <Paragraph position="2"> Let us consider a dependency p'roba, bility (I)P): P('r(bi, bj)l.s'), a t)rol)al)ility l;lu~t 'r(bi, b:j) hohts in a Given senl:ence s: Vi. Ej P(','(51, t,j)l.4 = a.</Paragraph>
    </Section>
    <Section position="2" start_page="348" end_page="348" type="sub_section">
      <SectionTitle>
2.2 Estimation of DPs
</SectionTitle>
      <Paragraph position="0"> Some of the state-of:the-art 1)rol)at)ilis(;ic bmguage inodels such as the l)ottomu t) models P(l~,l.,.) propos,,d by Collins (1:)96) and Fujio et al. (1998) directly estimate DPs tbr :~ given int)ut , whereas other models su('h as PCFOt)ased tel)down generation mod(;ls P(H,,,s) do not, (Charnink, 1997; Collins, 1997; Shir~fi et ~rl., 1998). If the latter type of mod(,'ls were totally exchlded fronl any committee, our commit;tee-based framework would not work well in I)raclice. Fortm:ately, how(:ver, even tbr such a model, one can still estimate l)l?s in the following way if the rood(;1 provides the n-best del)en1A bunsctsu phrase (BP) is a chunk of words (-onsist;ing of a content word (noun, verl), adjective, etc.) accoml)mfied by sonic flmctional word(s) (i)arti(:le, mlxiliary, etc.). A .lai)anes(' sentc'nce can 1)c analyzed as a sequence of BPs, which constitutes an inter-BP deI)endency structure dency structure candidates cout)led with prot)abilistic scores.</Paragraph>
      <Paragraph position="1"> Let Ri be the i-th best del)endency st;ruct;ure (i = 1,..., 'n) of ;~ given input ,s' according to a given model, and h;t ~H l)e a set; of H,i. Then, ,.,u, l,e csl;ima|;ed by the following</Paragraph>
      <Paragraph position="3"> where P'R.u is the probal)ilit;y mass of H, E 7~Lr, and prn. is the probability mass of R ~ ~H that suppori;s 'r(bi, bj). Tile approximation error c is given 1)y c &lt; l;r~--1%, where l),p,, is 1;t2(; -- l~p~ ' prol)abilil;y mass of all the dependency structure candidates for s (see (Peele, 1993) for the l?roof). This means that the al)t)roximation error is negligil)le if P'R,, is sut\[iciently close to 1),R, which holds for a reasonably small mlmt)er 'n in lnOSt cases in practical statistical parsing.</Paragraph>
    </Section>
    <Section position="3" start_page="348" end_page="349" type="sub_section">
      <SectionTitle>
2.3 Coverage-accuracy curves
</SectionTitle>
      <Paragraph position="0"> We then conside, r the task of selecting dependency relations whose estimated probability is higher I:han a (:e|:i;ain l;hreshoht o- (0 &lt; a &lt; 1).</Paragraph>
      <Paragraph position="1"> When (r is set 1;o be higher (closer to 1.0), t;he accuracy is cxt)ected to become higher, while the coverage is ext)ecl;ed to become lowe,:, and vi(:e versm Here, (;over~ge C* and a,(;ctlra(;y A are defined as follows: # of the. decided relations C # of nil the re, lations in I;\]le t;est so,}i2 )/~ # of the COl'rectly decided relatidegn~3~vJ A # of the decided relations Moving the threshohl cr from 1.0 down toward 0.0, one (:an obtain a coverage-a(:cura(:y (:urve (C-A curve). In 1)rol)al)ilistic t)artial parsing, we ewflunte the t)erforman('e of a model ~mcording to its C-A curve. A few examt)les are shown in Figure 1, which were obtained in our ext)erim(mt (see Section 4). Ot)viously, Figure 1 shows that model A outt)erformed the or, her two. To summarize a C-A cIlrve, we use the ll-t)oint average of accuracy (l l-t)oint at:curacy, hereafl;er), where the eleven points m'e C = 0.5, 0.55,..., 1.0. The accuracy of total parsing correst)onds to the accuracy of the t)oint in a C-A curve where C = 1.0. We call it total ~ccuracy to distinguish it from l\]-l)oint at:el&gt; racy. Not;('. that two models with equal achieve- null meuts in total accuracy may be different in ll-point accuracy. In fact, we found such cases in our experiments reported below. Plotting C-A curves enable us to make a more fine-grained perfbrmance evaluation of a model.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="349" end_page="349" type="metho">
    <SectionTitle>
3 Committee-based probabilis-
</SectionTitle>
    <Paragraph position="0"> tic partial parsing We consider a general scheme of comnfittee-based probabilistic partial parsing as illustrated in Figure 2. Here we assume that each connnittee member M~ (k = 1,..., m) provides a DP matrix PM~(r(bi, bj)ls ) (bi, bj E s) tbr each input 8. Those matrices are called inlmt matrices, and are give:: to the committee as its input.</Paragraph>
    <Paragraph position="1"> A committee consists of a set of weighting functions and a combination flmction. The role assigned to weighting flmctions is to standardize input matrices. The weighting function associated with model Mk transforms an input matrix given by MI~ to a weight matrix WaG- The majority flmction then combines all the given weight matrices to produce an output matrix O, which represents the final decision of the con&gt; mittee. One can consider various options for both flmctions.</Paragraph>
    <Section position="1" start_page="349" end_page="349" type="sub_section">
      <SectionTitle>
3.1 Weighting functions
</SectionTitle>
      <Paragraph position="0"> We have so far considered the following three options.</Paragraph>
      <Paragraph position="1"> Simple The simplest option is to do nothing:</Paragraph>
      <Paragraph position="3"> o Mk where wij is the (i,j) element of I/VMk.</Paragraph>
      <Paragraph position="4"> Normal A bare DP may not be a precise estimation of the actual accuracy. One can see this by plotting probability-accuracy curves (P-A curves) as shown in Figure 3. Figure 3 shows that model A tends to overestimate DPs, while</Paragraph>
      <Paragraph position="6"> models itlpllt i weight matrices i .................................. : matrices &amp;quot;tV l:: &amp;quot;~Vt~iglt|iJI g \]&amp;quot;u n C/1\]Oll</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="349" end_page="350" type="metho">
    <SectionTitle>
CF: Ct3mhinatlan I:mlcl\[.n
</SectionTitle>
    <Paragraph position="0"> model C tends to underestimate DPs. This lneans that if A and C give different answers with the same DP, C's answer is more likely to be correct. Thus, it is :sot necessarily a good strategy to simply use give:: bare DPs in weighted majority. To avoid this problem, we consider the tbllowing weighting flmction:</Paragraph>
    <Paragraph position="2"> where AMk (P) is the function that returns the expected accuracy of Mk's vote with its depen-Mk dency probability p, and oz i is a normalization factor. Such a function can be trained by plotting a P-A curve fbr training data. Note that training data should be shared by all the committee members. In practice, tbr training a P-A curve, some smoothing technique should be applied to avoid overfitting.</Paragraph>
    <Paragraph position="3"> Class The standardization process in the above option Normal can also be seen as an effort for reducing the averaged cross entropy of the model on test, data. Since P-A curves tend to defi~,r not only between different models but also between different problem classes, if one incorporates some problem classification into (5), the averaged cross entropy is expected  to be reduced fllrther:</Paragraph>
    <Paragraph position="5"> where AMkcl, i (P) is the P-A curve of model Mk only tbr the problems of class Cb~ in training data, and flMk is a normalization factor. For i probleln classification, syntactie/lexieal features of bi may be useful.</Paragraph>
    <Section position="1" start_page="350" end_page="350" type="sub_section">
      <SectionTitle>
3.2 Combining functions
</SectionTitle>
      <Paragraph position="0"> For combination flmctions, we have so far considered only simple weighted voting, which averages the given weight matrices:</Paragraph>
      <Paragraph position="2"> where o.i.f/~:_ is the (i, j) element of O.</Paragraph>
      <Paragraph position="3"> Note that the committee-based partial parsing frmnework t)resented here can be see, n as a generalization of the 1)reviously proposed voting-based techniques in the following respects: null  (a) A committee a(:(:epts probabilistically parameterized votes as its intmt.</Paragraph>
      <Paragraph position="4"> (d) A committee ac(:el)ts multil)le voting (i.e. it; allow a comnfittee menfl)er to vote not only to the 1)est-scored calMi(late trot also to all other potential candidates).</Paragraph>
      <Paragraph position="5"> ((:) A. (:ommittee 1)rovides a metals tbr standardizing original votes.</Paragraph>
      <Paragraph position="6"> (b) A committee outl)uts a 1)rot)abilisti(&amp;quot; distribution representing a tinal decision, which constitutes a C-A curve.</Paragraph>
      <Paragraph position="7">  For examt)le, none of simple voting techniques for word class tagging t)roposed 1)y van Halteren et al. (1998) does not accepts multiple voting. Henderson and Brill (1999) examined constituent voting and naive Bayes classifi(:alion for parsing, ol)taining positive results ibr each. Simple constituent voting, however, does not accept parametric votes. While Naive Bayes seems to partly accept l)arametric multit)le voting, it; does not consider either sl;andardization or coverage/accuracy trade-off.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML