File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/97/p97-1003_evalu.xml

Size: 6,735 bytes

Last Modified: 2025-10-06 14:00:20

<?xml version="1.0" standalone="yes"?>
<Paper uid="P97-1003">
  <Title>Three Generative, Lexicalised Models for Statistical Parsing</Title>
  <Section position="6" start_page="20" end_page="22" type="evalu">
    <SectionTitle>
4 Results
</SectionTitle>
    <Paragraph position="0"> The parser was trained on sections 02 - 21 of the Wall Street Journal portion of the Penn Treebank (Marcus et al. 93) (approximately 40,000 sentences), and tested on section 23 (2,416 sentences). We use the PAR.SEVAL measures (Black et al. 91) to compare performance: Labeled Precision = number of correct constituents in proposed parse number of constituents in proposed parse  number of crossing brackets per sentence. 0 CBs, &lt; 2 CBs are the percentage of sentences with 0 or &lt; 2 crossing brackets respectively.</Paragraph>
    <Paragraph position="1"> Labeled Recall -~ number o/ correct constituents in proposed parse number of constituents in treebank parse Crossing Brackets ---- number of constituents which violate constituent boundaries with a constituent in the treebank parse.</Paragraph>
    <Paragraph position="2"> For a constituent to be 'correct' it must span the same set of words (ignoring punctuation, i.e. all tokens tagged as commas, colons or quotes) and have the same label s as a constituent in the treebank parse. Table 2 shows the results for Models 1, 2 and 3. The precision/recall of the traces found by Model 3 was 93.3%/90.1% (out of 436 cases in section 23 of the treebank), where three criteria must be met for a trace to be &amp;quot;correct&amp;quot;: (1) it must be an argument to the correct head-word; (2) it must be in the correct position in relation to that head word (preceding or following); (3) it must be dominated by the correct non-terminal label. For example, in figure 5 the trace is an argument to bought, which it follows, and it is dominated by a VP. Of the 436 cases, 342 were string-vacuous extraction from subject position, recovered with 97.1%/98.2% precision/recall; and 94 were longer distance cases, recovered with 76%/60.6% precision/recall 9</Paragraph>
    <Section position="1" start_page="21" end_page="22" type="sub_section">
      <SectionTitle>
4.1 Comparison to previous work
</SectionTitle>
      <Paragraph position="0"> Model 1 is similar in structure to (Collins 96) -the major differences being that the &amp;quot;score&amp;quot; for each bigram dependency is 7't(L{,liIH, P, h, distancet) 8(Magerman 95) collapses ADVP and PRT to the same label, for comparison we also removed this distinction when calculating scores.</Paragraph>
      <Paragraph position="1"> 9We exclude infinitival relative clauses from these figures, for example &amp;quot;I called a plumber TRACE to fix the sink&amp;quot; where 'plumber' is co-indexed with the trace sub-ject of the infinitival. The algorithm scored 41%/18% precision/recall on the 60 cases in section 23 -- but infinitival relatives are extremely difficult even for human annotators to distinguish from purpose clauses (in this case, the infinitival could be a purpose clause modifying 'called') (Ann Taylor, p.c.) rather than Pz(Li, P, H I li, h, distancel), and that there are the additional probabilities of generating the head and the STOP symbols for each constituent. However, Model 1 has some advantages which may account for the improved performance.</Paragraph>
      <Paragraph position="2"> The model in (Collins 96) is deficient, that is for most sentences S, Y~T 7~( T \] S) &lt; 1, because probability mass is lost to dependency structures which violate the hard constraint that no links may cross.</Paragraph>
      <Paragraph position="3"> For reasons we do not have space to describe here, Model 1 has advantages in its treatment of unary rules and the distance measure. The generative model can condition on any structure that has been previously generated -- we exploit this in models 2 and 3 -- whereas (Collins 96) is restricted to conditioning on features of the surface string alone.</Paragraph>
      <Paragraph position="4"> (Charniak 95) also uses a lexicalised generative model. In our notation, he decomposes P(RHSi l LHSi) as &amp;quot;P(R,~...R1HL1..Lm \] P,h) x 1-L=I..~ 7~(r~l P, Ri, h) x I-L=l..m 7)(lil P, Li, h). The Penn treebank annotation style leads to a very large number of context-free rules, so that directly estimating 7~(R .... R1HL1..Lm I P, h) may lead to sparse data problems, or problems with coverage (a rule which has never been seen in training may be required for a test data sentence). The complement/adjunct distinction and traces increase the number of rules, compounding this problem.</Paragraph>
      <Paragraph position="5"> (Eisner 96) proposes 3 dependency models, and gives results that show that a generative model similar to Model 1 performs best of the three. However, a pure dependency model omits non-terminal information, which is important. For example, &amp;quot;hope&amp;quot; is likely to generate a VP(T0) modifier (e.g., I hope \[VP to sleep\]) whereas &amp;quot;'require&amp;quot; is likely to generate an S(T0) modifier (e.g., I require IS Jim to sleep\]), but omitting non-terminals conflates these two cases, giving high probability to incorrect structures such as &amp;quot;I hope \[Jim to sleep\]&amp;quot; or &amp;quot;I require \[to sleep\]&amp;quot;. (Alshawi 96) extends a generative dependency model to include an additional state variable which is equivalent to having non-terminals -- his  suggestions may be close to our models 1 and 2, but he does not fully specify the details of his model, and doesn't give results for parsing accuracy. (Miller et al. 96) describe a model where the RHS of a rule is generated by a Markov process, although the process is not head-centered. They increase the set of non-terminals by adding semantic labels rather than by adding lexical head-words.</Paragraph>
      <Paragraph position="6"> (Magerman 95; Jelinek et al. 94) describe a history-based approach which uses decision trees to estimate 7a(T\[S). Our models use much less sophisticated n-gram estimation methods, and might well benefit from methods such as decision-tree estimation which could condition on richer history than just surface distance.</Paragraph>
      <Paragraph position="7"> There has recently been interest in using dependency-based parsing models in speech recognition, for example (Stolcke 96). It is interesting to note that Models 1, 2 or 3 could be used as language models. The probability for any sentence can be estimated as P(S) = ~~.TP(T,S), or (making a Viterbi approximation for efficiency reasons) as 7)(S) .~ P(Tb~st, S). We intend to perform experiments to compare the perplexity of the various models, and a structurally similar 'pure' PCFG 1deg.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML