XML Viewer - p06-1064

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/06/p06-1064_evalu.xml
Size: 3,994 bytes
Last Modified: 2025-10-06 13:59:39
<?xml version="1.0" standalone="yes"?>
<Paper uid="P06-1064">
  <Title>Creating a CCGbank and a wide-coverage CCG lexicon for German</Title>
  <Section position="7" start_page="509" end_page="510" type="evalu">
    <SectionTitle>
5 Evaluation
</SectionTitle>
    <Paragraph position="0"> Translation coverage The algorithm can fail at several stages. If the graph cannot be turned into a tree, it cannot be translated. This happens in 1.3% (647) of all sentences. In many cases, this is due  to coordinated NPs or PPs where one or more conjuncts are extraposed. We believe that these are anaphoric, and further preprocessing could take care of this. In other cases, this is due to verb topicalization (gegeben hat Peter Maria das Buch), which our algorithm cannot currently deal with.</Paragraph>
    <Paragraph position="1">  For 1.9% of the sentences, the algorithm cannot obtain a correct CCG derivation. Mostly this is the case because some traces and extraposed elements cannot be discharged properly. Typically this happens either in local scrambling, where an object of the main verb appears between the auxiliary and the subject (hat das Buch Peter...)  ,or when an argument of a noun that appears in a relative clause is extraposed to the right. There are also a small number of constituents whose head is not annotated. We ignore any gapping construction or argument cluster coordination that we cannot get into the right shape (1.5%), 732 sentences). There are also a number of other constructions that we do not currently deal with. We do not process sentences if the root of the graph is a &amp;quot;virtual root&amp;quot; that does not expand into a sentence (1.7%, 869). This is mostly the case for strings such as Frankfurt (Reuters)), or if we cannot identify a head child of the root node (1.3%, 648; mostly fragments or elliptical constructions).</Paragraph>
    <Paragraph position="2"> Overall, we obtain CCG derivations for 92.4% (46,628) of all 54,0474 sentences, including 88.4% (12,122) of those whose Tiger graphs are marked as discontinuous (13,717), and 95.2% of all 48,957 full sentences (excluding headless roots, and fragments, but counting coordinate structures such as gapping).</Paragraph>
    <Paragraph position="3"> Lexicon size There are 2,506 lexical category types, but 1,018 of these appear only once. 933 category types appear more than 5 times.</Paragraph>
    <Paragraph position="4"> Lexical coverage In order to evaluate coverage of the extracted lexicon on unseen data, we split the corpus into segments of 5,000 sentences (ignoring the last 474), and perform 10-fold crossvalidation, using 9 segments to extract a lexicon and the 10th to test its coverage. Average coverage is 86.7% (by token) of all lexical categories. Coverage varies between 84.4% and 87.6%. On average, 92% (90.3%-92.6%) of the lexical tokens  The corresponding CCG derivation combines the remnant complements as in argument cluster coordination.  This problem arises because Tiger annotates subjects as arguments of the auxiliary. We believe this problem could be avoided if they were instead arguments of the non-finite verb. that appear in the held-out data also appear in the training data. On these seen tokens, coverage is 94.2% (93.5%-92.6%). More than half of all missing lexical entries are nouns.</Paragraph>
    <Paragraph position="5"> In the English CCGbank, a lexicon extracted from section 02-21 (930,000 tokens) has 94% coverage on all tokens in section 00, and 97.7% coverage on all seen tokens (Hockenmaier and Steedman, 2005). In the English data set, the proportion of seen tokens (96.2%) is much higher, most likely because of the relative lack of derivational and inflectional morphology. The better lexical coverage onseen tokens isalso tobe expected, given thatthe flexible word order of German requires case markings on all nouns as well as at least two different categories for each tensed verb, and more in order to account for local scrambling.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML