File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/01/p01-1040_metho.xml

Size: 15,126 bytes

Last Modified: 2025-10-06 14:07:40

<?xml version="1.0" standalone="yes"?>
<Paper uid="P01-1040">
  <Title>A Common Framework for Syntactic Annotation</Title>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 http://www.cis.upenn.edu/treebank
</SectionTitle>
    <Paragraph position="0"> provide a constituency analysis4 but rather specify grammatical relations among elements explicitly; for example, the sentence Paul intends to leave IBM could be represented as shown in Figure 2, where the predicate is the relation type, the first argument is the head, the second the dependent, and additional arguments may provide category-specific information (e.g., introducer for prepositional phrases, etc.).</Paragraph>
    <Paragraph position="1">  into the front room, closing the door behind him.</Paragraph>
    <Paragraph position="3"/>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 A Model for Syntactic Annotation
</SectionTitle>
    <Paragraph position="0"> The goal in the XCES is to provide a framework for annotation that is theory and tagset independent. We accomplish this by treating the description of any specific syntactic annotation scheme as a process involving several knowledge sources that interact at various levels. The process allows one to specify, on the one hand, the informational properties of the scheme (i.e., its capacity to represent a given piece of information), and, on the other, the way the scheme can be instantiated (e.g., as an XML document). Figure 3 shows the overall architecture of the XCES framework for syntactic annotation.</Paragraph>
  </Section>
  <Section position="6" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 So-called hybrid systems (e.g., Basili, et al., 199;
</SectionTitle>
    <Paragraph position="0"> Grefenstette, 1999) combine constituency analysis and functional dependencies, usually producing a shallow constituent parse that brackets major phrase types and identifying the dependencies between heads of constituents.</Paragraph>
    <Paragraph position="1"> Figure 3. Overall architecture of the XCES annotation framework Two knowledge sources are used define the abstract model: Data Category Registry: Within the framework of the XCES we are establishing an inventory of data categories for syntactic annotation, initially based on the EAGLES Recommendations for Syntactic Annotation of Corpora (Leech et al., 1996). Data categories are defined using RDF descriptions that formalize the properties associated with each. The categories are organized in a hierarchy, from general to specific. For example, a general dependent relation may be defined, which may have one of the possible values argument or modifier; argument in turn may have the possible values subject, object, or complement; etc.5 Note that RDF descriptions function much like class definitions in an object-oriented programming language: they provide, effectively, templates that describe how objects may be instantiated, but do not constitute the objects themselves.</Paragraph>
    <Paragraph position="2"> Thus, in a document containing an actual annotation, several objects with the type argument may be instantiated, each with a different value. The RDF schema ensures that each instantiation of argument is recognized as a sub-class of dependent and inherits the appropriate properties.</Paragraph>
    <Paragraph position="3"> Structural Skeleton: a domain-dependent abstract structural framework for syntactic  annotations, capable of fully capturing all the information in a specific annotation scheme. The structural skeleton for syntactic annotations is described below in section 12.1.</Paragraph>
    <Paragraph position="4"> Two other knowledge sources are used to define a project-specific format for the annotation scheme, in terms of its expressive power and its instantiation in XML: Data Category Specification (DCS): describes the set of data categories that can be used within a given annotation scheme, again using RDF schema. The DCS defines constraints on each category, including restrictions on the values they can take (e.g., &amp;quot;text with markup&amp;quot;; a &amp;quot;picklist&amp;quot; for grammatical gender, or any of the data types defined for XML), restrictions on where a particular data category can appear (level in the structural hierarchy). The DCS may include a subset of categories from the DCR together with application-specific categories additionally defined in the DCS. The DCS also indicates a level of granularity based on the DCR hierarchy.</Paragraph>
    <Paragraph position="5"> Dialect specification: defines, using XML schemas, XSLT scripts, and XSL style sheets, the project-specific XML format for syntactic annotations. The specifications may include: * Data category instantiation styles: Data categories may be realized in a project-specific scheme in any of a variety of formats. For example, if there exists a data category NounPhrase, this may be realized as an &lt;NounPhrase&gt; element (possibly containing additional elements), a typed  element (e.g. &lt;cat type=NounPhrase&gt;), tag content (e.g., &lt;cat&gt;NounPhrase&lt;/cat&gt;), etc.</Paragraph>
    <Paragraph position="6"> * Data category vocabulary styles: Project null specific formats can utilize names different from those in the Data Category Registry; for instance, a DCR specification for NounPhrase can be expressed as NP or SN ( syntagme nominal ) in the project-specific format, if desired.</Paragraph>
    <Paragraph position="7"> * Expansion structures: A project-specific format may alter the structure of the annotation as expressed using the structural skeleton. For example, it may be desirable for processing or other reasons to create two sub-nodes under a given &lt;struct&gt; node, one to group features and one to group relations.</Paragraph>
    <Paragraph position="8"> The combination of the structural skeleton and the DCS defines a virtual annotation markup language (AML). Any information structure that corresponds to a virtual AML has a canonical expression as an XML document; therefore, the inter-operability of different AMLs is dependent only on their compatibility at the virtual level. As such, virtual AML is the hub of the annotation framework: it defines a lingua franca for syntactic annotations that can be used to compare and merge annotations, as well as enable design of generic tools for visualization, editing, extraction, etc.</Paragraph>
    <Paragraph position="9"> The combination of a virtual AML with the Dialect Specification provides the information necessary to automatically generate a concrete AML representation of the annotation scheme, which conforms to the project-specific format provided in the Dialect Specification. XSLT filters translate between the representations of the annotation in concrete and virtual AML, as well as between non-XML formats (such as the LISP-like PTB notation) and concrete AML.6</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.1 The Structural Skeleton
</SectionTitle>
      <Paragraph position="0"> For syntactic annotation, we can identify a general, underlying model that informs current practice: specification of constituency relations (with some set of application-specific names and properties) among syntactic or grammatical components (also with a set of application-specific names and properties), whether this is modeled with a tree structure or the relations are given explicitly.</Paragraph>
      <Paragraph position="1"> Because of the common use of trees in syntactic annotation, together with the natural tree-structure of markup in XML documents, we provide a structural skeleton for syntactic markup following this model. The most important element in the skeleton is the &lt;struct&gt; element, which represents a node (level) in the syntax tree. &lt;struct&gt; elements may be recursively nested at any level to reflect the structure of the corresponding tree. The &lt;struct&gt; element has the following attributes:</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="0" end_page="0" type="metho">
    <SectionTitle>
6 Strictly speaking, an application-specific format could be
</SectionTitle>
    <Paragraph position="0"> translated directly into the virtual AML, eliminating the need for the intermediary concrete AML format. However, especially for existing formats, it is typically more straightforward to perform the two-step process.</Paragraph>
    <Paragraph position="1"> * type : specifies the node label (e.g., S , NP , etc.) or points to an object in another document that provides the value. This allows specifying complex data items as annotations. It also enables generating a single instantiation of an annotation value in a separate document that can be referenced as needed.</Paragraph>
    <Paragraph position="2"> * xlink : points to the data to which the annotation applies. In the XCES, we recommend the use of stand-off annotation i.e., annotation that is maintained in a document separate from the primary (annotated) data.7 The xlink attribute uses the XML Path Language (XPath) (Clark  &amp; DeRose, 1999) to specify the location of the relevant data in the primary document.</Paragraph>
    <Paragraph position="3"> * ref : refers to a node defined elsewhere, used instead of xlink.</Paragraph>
    <Paragraph position="4"> * rel@: specifies a type of relation (e.g., subj ) * head : specifies the node corresponding to the head of the relation * dependent : specifies the node corresponding to the dependent of the relation * introducer : specifies the node corresponding to an introducing word or phrase * initial : gives a thematic or semantic role of a  component, e.g., subj for the object of a by-phrase in a passive sentence.</Paragraph>
    <Paragraph position="5"> The hierarchy of &lt;struct&gt; elements corresponds to the nodes in a phrase structure analysis; each &lt;struct&gt; element is typed accordingly. The grammar underlying the annotation therefore specifies constraints on embedding that can be instantiated in an XML schema, which can then be used to prevent or detect tree structures that do not conform to the grammar. Conversely, the grammar rules implicit in annotated treebanks, which are typically not annotated according to a formal grammar, can be easily extracted from the abstract structural encoding.</Paragraph>
    <Paragraph position="6"> The skeleton also includes a &lt;feat&gt; (feature) element, which can be used to provide additional information (e.g., gender, number) that is attached to the node in the tree represented by the enclosing &lt;struct&gt; element. Like &lt;struct&gt;, this element can be recursively nested or can point to a description in another 7 The stand-off scheme also provides means to represent ambiguities, since there can be multiple links between data and alternative annotations.</Paragraph>
    <Paragraph position="7"> document, thereby providing means to associate information at any level of detail or complexity to the annotated structure.</Paragraph>
    <Paragraph position="8"> Figure 4 shows the annotation from the PTB (Figure 1) rendered in the abstract XML format. Note that in this example, relations are encoded only when they appear explicitly in the original annotation (therefore, heads of relations default to unknown .) An XSLT script could be used to create a second XML document that includes the relations implicit in the embedding (e.g., the first embedded &lt;struct&gt; with category NP has relation subject , the first VP is the head, etc.). A strict dependency annotation encoded in the abstract format uses a flat hierarchy and specifies all relations explicitly with the rel attribute, as shown in Figure 5.8</Paragraph>
  </Section>
  <Section position="8" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Using the XCES Scheme
</SectionTitle>
    <Paragraph position="0"> The Virtual AML provides a pivot format that enables comparison of annotations in different formats including not only different constituency-based annotations, but also constituency-based and dependency annotations.</Paragraph>
    <Paragraph position="1"> For example, the PTB annotation corresponding to the dependency annotation in Figure 2 is shown in Figure 6. Figure 7 gives the corresponding encoding in the XCES abstract scheme. It is relatively trivial with an XSLT script to extract the information in the dependency annotation (Figure 5) from the PTB encoding (Figure 7) to produce a nearly identical dependency encoding. The script would use rules to make relations that are implicit in the structure of the PTB encoding explicit (for example, the xcomp relation that is implicit in the embedding of the S phrase).</Paragraph>
    <Paragraph position="2"> The ability to generate a common representation for different annotations overcomes several obstacles that have hindered evaluation exercises in the past. For instance, the evaluation technique used in the PARSEVAL exercise is applicable to phrase structure analyses only, and cannot be applied to dependency-style analyses or lexical parsing frameworks such as finite-state constraint parsers. As the example above shows, this 8 For the sake of readability, this encoding assumes that the sentence Paul intends to leave IBM is marked up as &lt;s1&gt;&lt;w1&gt;Paul&lt;/w1&gt;&lt;w2&gt;intends&lt;/w2&gt;&lt;w3&gt;to&lt;/w3&gt;&lt;w 4&gt;leave&lt;/w4&gt;&lt;w5&gt;IBM&lt;/w5&gt;&lt;/s1&gt;.</Paragraph>
    <Paragraph position="3"> problem can be addressed using the XCES framework.</Paragraph>
    <Paragraph position="4"> It has also been noted that that the PARSEVAL bracket-precision measure penalizes parsers that return more structure than exists in the relatively flat treebank structures, even if they are correct (Srinivas, et al., 1995). XSLT scripts can extract the appropriate information for comparison purposes while retaining links to additional parts of the annotation in the original document, thus eliminating the need to dumb down parser output in order to participate in the evaluation exercise. Similarly, information lost in the transduction from phrase structure to a dependency-based analysis (as in the example above), which, as Atwell (1996) points out, may eliminate grammatical information potentially required for later processing, can also be retained.</Paragraph>
    <Paragraph position="5">  &lt;struct id=&amp;quot;s0&amp;quot; type=&amp;quot;S &gt; &lt;struct id=&amp;quot;s1&amp;quot; type=&amp;quot;NP target=&amp;quot;w1 rel=&amp;quot;SBJ&amp;quot; head=&amp;quot;s2&amp;quot;/&gt; &lt;struct id=&amp;quot;s2&amp;quot; type=&amp;quot;VP target=&amp;quot;w2&amp;quot;/&gt; &lt;struct id=&amp;quot;s3&amp;quot; type=&amp;quot;S &gt; &lt;struct id=&amp;quot;s4&amp;quot; ref=&amp;quot;s1&amp;quot; rel=&amp;quot;SBJ&amp;quot; head=&amp;quot;s6&amp;quot;/&gt; &lt;struct id=&amp;quot;s5&amp;quot; type=&amp;quot;VP target=&amp;quot;w3&amp;quot;&gt; &lt;struct id=&amp;quot;s6&amp;quot; type=&amp;quot;VP target=&amp;quot;w4&amp;quot;&gt; &lt;struct id= s7 type=&amp;quot;NP target=&amp;quot;w5&amp;quot;/&gt;</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML