File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/99/p99-1065_intro.xml

Size: 1,991 bytes

Last Modified: 2025-10-06 14:06:58

<?xml version="1.0" standalone="yes"?>
<Paper uid="P99-1065">
  <Title>A Statistical Parser for Czech*</Title>
  <Section position="3" start_page="505" end_page="505" type="intro">
    <SectionTitle>
2 Data and Evaluation
</SectionTitle>
    <Paragraph position="0"> The Prague Dependency Treebank PDT (Haji~, 1998) has been modeled after the Penn Treebank (Marcus et al. 93), with one important exception: following the Praguian linguistic tradition, the syntactic annotation is based on dependencies rather than phrase structures. Thus instead of &amp;quot;nonterminal&amp;quot; symbols used at the non-leaves of the tree, the PDT uses so-called analytical functions capturing the type of relation between a dependent and its governing node. Thus the number of nodes is equal to the number of tokens (words + punctuation) plus one (an artificial root node with rather technical function is added to each sentence). The PDT contains also a traditional morpho-syntactic annotation (tags) at each word position (together with a lemma, uniquely representing the underlying lexicai unit). As Czech is a HI language, the size of the set of possible tags is unusually high: more than 3,000 tags may be assigned by the Czech morphological analyzer. The PDT also contains machine-assigned tags and lemmas for each word (using a tagger described in (Haji~ and Hladka, 1998)).</Paragraph>
    <Paragraph position="1"> For evaluation purposes, the PDT has been divided into a training set (19k sentences) and a development/evaluation test set pair (about 3,500 sentences each). Parsing accuracy is defined as the ratio of correct dependency links vs. the total number of dependency links in a sentence (which equals, with the one artificial root node added, to the number of tokens in a sentence). As usual, with the development test set being available during the development phase, all final results has been obtained on the evaluation test set, which nobody could see beforehand.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML