File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/03/w03-1707_intro.xml

Size: 4,376 bytes

Last Modified: 2025-10-06 14:02:05

<?xml version="1.0" standalone="yes"?>
<Paper uid="W03-1707">
  <Title>Annotating the Propositions in the Penn Chinese Treebank</Title>
  <Section position="2" start_page="0" end_page="0" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> Linguistically interpreted corpora are instrumental in supervised machine learning paradigms of natural language processing. The information encoded in the corpora to a large extent determines what can be learned by supervised machine learning systems.</Paragraph>
    <Paragraph position="1"> Therefore, it is crucial to encode the desired level of information for its automatic acquisition. The creation of the Penn English Treebank (Marcus et al., 1993), a syntactically interpreted corpus, played a crucial role in the advances in natural language parsing technology (Collins, 1997; Collins, 2000; Charniak, 2000) for English. The creation of the Penn Chinese Treebank (Xia et al., 2000) is also beginning to help advance technologies in Chinese syntactic analysis (Chiang, 2000; Bikel and Chiang, 2000). Since the treebanks are generally syntactically oriented (cf. Sinica Treebank (Chen et al., to appear)), the information encoded there is &amp;quot;shallow&amp;quot;. Important information useful for natural language applications is missing. Most notably, significant regularities in the predicate-argument structure of lexical items are not captured. Recent effort in semantic annotation, the creation of the Penn Proposition Bank (Kingsbury and Palmer, 2002) on top of the Penn English Treebank is beginning to address this issue for English. In this new layer of annotation, the regularities of the predicates, mostly verbs, are captured in the predicate-argument structure. For example, in the sentences &amp;quot;The Congress passed the bill&amp;quot; and &amp;quot;The bill passed&amp;quot;, it is intuitively clear that &amp;quot;the bill&amp;quot; plays the same role in the two occurrences of the verb &amp;quot;pass&amp;quot;. Similar regularities also exist in Chinese. For example, in &amp;quot; /this /CL /bill /pass /AS&amp;quot; and &amp;quot; /Congress /pass /AS /this /CL /bill&amp;quot;, &amp;quot; /bill&amp;quot; also plays the same role for the verb &amp;quot; /pass&amp;quot; even though it occurs in different syntactic positions (subject and object respectively).</Paragraph>
    <Paragraph position="2"> Capturing such lexical regularities requires a &amp;quot;deeper&amp;quot; level of annotation than generally provided in a typical syntactically oriented treebank. It also requires making sense distinctions at the appropriate granularity. For example, the regularities demonstrated for &amp;quot;pass&amp;quot; does not exist in other senses of this verb. For example, in &amp;quot;He passed the exam&amp;quot; and &amp;quot;He passed&amp;quot;, the object &amp;quot;the exam&amp;quot; of the transitive use of &amp;quot;pass&amp;quot; does not play the same role as the subject &amp;quot;he&amp;quot; of the intransitive use. In fact, the subject plays the same role in both sentences.</Paragraph>
    <Paragraph position="3"> However, how deep the annotation can go is constrained by two important factors: how consistently human annotators can implement this type of annotation (the consistency issue) and whether the annotated information is learnable by machine (the learnability issue). Making fine-grained sense distinctions, in particular, has been known to be difficult for human annotators as well as machine-learning systems (Palmer et al., submitted). It seems generally true that structural information is more learnable than non-structural information, as evidenced by the higher parsing accuracy and relatively poor fine-grained WSD accuracy. With this in mind, we will propose a level of semantic annotation that still can be captured in structural terms and add this level of annotation to the Penn Chinese Treebank.</Paragraph>
    <Paragraph position="4"> The rest of the paper is organized as follows. In Section 2, we will discuss the annotation model in detail and describe our representation scheme. We will discuss some complications in Section 3 and some implementation issues in Section 4. Possible applications of this resource are discussed in Section 5.</Paragraph>
    <Paragraph position="5"> We will conclude in Section 6.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML