File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/concl/02/c02-1145_concl.xml
Size: 4,236 bytes
Last Modified: 2025-10-06 13:53:11
<?xml version="1.0" standalone="yes"?> <Paper uid="C02-1145"> <Title>Building a Large-Scale Annotated Chinese Corpus</Title> <Section position="6" start_page="0" end_page="0" type="concl"> <SectionTitle> 3 <www.cis.upenn.edu/~brandall> </SectionTitle> <Paragraph position="0"> grammatical relations as the most basic, namely, complementation, adjunction and coordination.</Paragraph> <Paragraph position="1"> Each of these three grammatical relations is assigned a unique structure represented schematically as follows: Besides the hierarchical representations, functional tags are used to mark additional information. These functional tags can be regarded as secondary and are used to complement the hierarchical representations. For example, in Chinese, multiple noun phrases (labeled NP in the Chinese Treebank) can occur before the verb within a clause (or above the verb if seen hierarchically). Structurally, they are all above the verb. Therefore, they are further differentiated by secondary functional tags. Generally, an NP marked -SBJ (subject) is required. There can optionally be topics (marked by -TPC) and adjuncts (marked by -ADV, -TMP, etc.).</Paragraph> <Paragraph position="2"> 'In the 1990s, Haier Group is highly recognized both domestically and overseas. ' Similarly, multiple NPs can also occur after the verb and they can be marked as -OBJ (for object) or -EXT (basically a cover term for all other phrases that are not marked -OBJ). This representational scheme allows the identification of such basic grammatical relations as subject, object and adjuncts in the corpus, which can be used to train syntactic parsers. However, as we will discuss in the next section, it is not enough for other CLP tasks that require deeper annotation.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.2 Future Annotation </SectionTitle> <Paragraph position="0"> The annotations provided during the bracketing phase may be enough for training syntactic parsers, but they are not sufficient for other CLP tools and applications. Among other things, there are at least two areas in which the Chinese treebank can be enhanced, that is, more fine-grained predicate-argument structure annotation and coreference annotation.</Paragraph> <Paragraph position="1"> As we have discussed above, one pre-verb noun phrase is marked as subject with the -SBJ tag and one post-verb noun phrase can be marked as -OBJ. However, the subject and object in the Chinese Treebank are defined primarily in structural terms. The semantic relation between the subject and the verb is not uniform across all verbs, or even for different instances of the same verb. The same is true for the relation between the object and the verb. For some verbs, there are systematic alternations between the subject and the verb, with the same NP occurring in the subject position in one sentence but in the object position in another, with the thematic role it assumes remaining constant.</Paragraph> <Paragraph position="2"> In 11, a21a23a22 a5a28a24a5a25 (&quot;New Year reception&quot;) is the subject in 11a while it is the object in 11b.</Paragraph> <Paragraph position="3"> However, in both cases, it is the theme. This may be problematic for some tools and applications. For an information extraction task, for example, if one wants to find all events held at a hotel, it is not enough to just look for the object in the parse tree, one also needs to know what thematic role the noun phrase assumes.</Paragraph> <Paragraph position="4"> One might also want to extract information from sentences with pronouns. We believe predicate-argument structure annotation and coreference annotation will be useful enhancements to this corpus and we will explore these possibilities.</Paragraph> <Paragraph position="5"> Summary In this paper we have shown that the use of annotation tools, not only for segmentation and POS tagging, but also for syntactic bracketing, can speed up the annotation process. We have also discussed how to ensure the quality of the corpus. We believe these methods are generalizable to the development of copora in other languages.</Paragraph> </Section> </Section> class="xml-element"></Paper>