File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/concl/93/h93-1025_concl.xml

Size: 4,540 bytes

Last Modified: 2025-10-06 13:57:01

<?xml version="1.0" standalone="yes"?>
<Paper uid="H93-1025">
  <Title>HEURISTICS FOR BROAD-COVERAGE NATURAL LANGUAGE PARSING</Title>
  <Section position="7" start_page="130" end_page="131" type="concl">
    <SectionTitle>
5. TESTS OF ESG
</SectionTitle>
    <Paragraph position="0"> Three recent tests of ESG coverage will be described, two on computer manual text and one on Wall Street Journal (WSJ) text. In all of the tests, there were no restrictions placed on vocabulary or length of test segments. Only the first parse given by ESG for each segment was considered. 4 For each segment, parse output was rated with one of three categories- p: perfect parse, pa: approximate parse, or bad: not p or pa. To get a p rating, all of the SG structure had to be correct, including for example slot labels; so this is a stricter requirement than just getting surface structure or bracketing correct. An approximate parse is a non-perfect one for which nevertheless all the feature structures are correct and surface structureis correct except for level of attachment of modifiers.</Paragraph>
    <Paragraph position="1"> In MT applications, one can often get reasonable translations using approximate parses.</Paragraph>
    <Paragraph position="2"> This way of rating parses is not an ideal one, because a parse for a very long sentence can be rated bad even when it has a single word with a wrong feature or slot. A combination of measures of partial success, such as those obtained by counting bracketing crossings, would be reasonable, since partially correct parses may still be useful. I can make up for this partially by reporting results as a function of segment length.</Paragraph>
    <Paragraph position="3"> Test 1 This was done using a set of approximately 88,000 segments from computer manuals on which no training of ESG had been done. Half of the corpus, simply consisting of the odd-numbered segments, was used for some lexical training. Slava Katz's terminology identification program \[6\] was run on this portion as well as a program that finds candidate terms by looking (roughly) for sequences of capitalized words. About one day was spent editing this auxiliary multi-word lexicon; theedited result consisted of 2176 entries. Then 100 segments were selected (automatically) at random from the (blind) even-numbered segments. The segments ranged in token list length from 2 to 38. The following table shows rating percentages for the segments of token list length &lt; N for selected _h r.</Paragraph>
    <Paragraph position="4">  best, only the first one output by the system was used.</Paragraph>
    <Paragraph position="5">  Test 2 From a set of about 2200 computer manual segments, 20% had been selected automatically at random, removed, and kept as a blind test set, and some ESG grammatical and lexicaAt work had been done on the remaining. The test was on 100 of the blind test sentences, which happened to have the same range in token list length, 2 to 38, as in the preceding test. The following table, similar in form to the preceding, shows results.</Paragraph>
    <Paragraph position="6">  Test 3 This used a corpus of over 4 million segments from the WSJ. No attempt was made to isolate a blind test set.</Paragraph>
    <Paragraph position="7"> However, little work on ESG has been done for WSJ text maybe looking at a total of 500 sentences over the span of work on ESG, with most of these obtained in other ways (I do not know if they were in the corpus in question). At any rate, automatic random choice from the 4M-segment corpus presumably resulted in segments that ESG had never seen in its life.</Paragraph>
    <Paragraph position="8"> Prior to selection of the test set, Katz's terminology identification was run on approximately 40% of the corpus. A portion of the results (based on frequency) underwent about a day's worth of editing, giving an auxiliary multiword lexicon with 1513 entries.</Paragraph>
    <Paragraph position="9"> Then 100 segments were selected at random from the 4M-segment corpus. They ranged in token list length from 6 to 57. ESG was run, with the following results, shown again as percentages for segments of length &lt; N: N %p %p orpa  ESG delivered some kind of analysis for all of the segments in the three tests, with about 11% fitted parses for the computer manual texts, and 26% fitted parses for the WSJ. The average parse time per segment was 1.5 seconds for the computer manuals and 5.6 seconds for the WSJ- on an IBM mainframe with a Prolog interpreter (not compiled).</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML