File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/99/p99-1065_intro.xml
Size: 1,991 bytes
Last Modified: 2025-10-06 14:06:58
<?xml version="1.0" standalone="yes"?> <Paper uid="P99-1065"> <Title>A Statistical Parser for Czech*</Title> <Section position="3" start_page="505" end_page="505" type="intro"> <SectionTitle> 2 Data and Evaluation </SectionTitle> <Paragraph position="0"> The Prague Dependency Treebank PDT (Haji~, 1998) has been modeled after the Penn Treebank (Marcus et al. 93), with one important exception: following the Praguian linguistic tradition, the syntactic annotation is based on dependencies rather than phrase structures. Thus instead of &quot;nonterminal&quot; symbols used at the non-leaves of the tree, the PDT uses so-called analytical functions capturing the type of relation between a dependent and its governing node. Thus the number of nodes is equal to the number of tokens (words + punctuation) plus one (an artificial root node with rather technical function is added to each sentence). The PDT contains also a traditional morpho-syntactic annotation (tags) at each word position (together with a lemma, uniquely representing the underlying lexicai unit). As Czech is a HI language, the size of the set of possible tags is unusually high: more than 3,000 tags may be assigned by the Czech morphological analyzer. The PDT also contains machine-assigned tags and lemmas for each word (using a tagger described in (Haji~ and Hladka, 1998)).</Paragraph> <Paragraph position="1"> For evaluation purposes, the PDT has been divided into a training set (19k sentences) and a development/evaluation test set pair (about 3,500 sentences each). Parsing accuracy is defined as the ratio of correct dependency links vs. the total number of dependency links in a sentence (which equals, with the one artificial root node added, to the number of tokens in a sentence). As usual, with the development test set being available during the development phase, all final results has been obtained on the evaluation test set, which nobody could see beforehand.</Paragraph> </Section> class="xml-element"></Paper>