File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/04/w04-0210_intro.xml
Size: 1,264 bytes
Last Modified: 2025-10-06 14:02:30
<?xml version="1.0" standalone="yes"?> <Paper uid="W04-0210"> <Title>Discourse Annotation and Semantic Annotation in the GNOME Corpus</Title> <Section position="3" start_page="0" end_page="0" type="intro"> <SectionTitle> 2 The Data </SectionTitle> <Paragraph position="0"> Texts from three domains were (partially) annotated. The museum subcorpus consists of descriptions of museum objects and brief texts about the artists that produced them.1 The pharmaceutical subcorpus is a selection of leaflets providing the patients with legally mandatory information about their medicine.2 The GNOME corpus also includes tutorial dialogues from the Sherlock corpus collected at the University of Pittsburgh. Each sub-corpus contains about 6,000 NPs, but not all types of annotation have been completed for all domains.</Paragraph> <Paragraph position="1"> All sentences, units and NPs have been identified, and all 'syntactic' properties of NPs (agreement feature and grammatical function). Anaphoric relations have been annotated in about half of the texts in each domain; and the more complex semantic properties (taxonomic properties, genericity, etc.) in about 25% of these texts. The total size of the annotated corpus is about 60K.</Paragraph> </Section> class="xml-element"></Paper>