File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/w04-2709_metho.xml

Size: 21,811 bytes

Last Modified: 2025-10-06 14:09:23

<?xml version="1.0" standalone="yes"?>
<Paper uid="W04-2709">
  <Title>Interlingual Annotation of Multilingual Text Corpora</Title>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Corpus
</SectionTitle>
    <Paragraph position="0"> The target data set is modeled on and an extension of the DARPA MT Evaluation data set (White and O'Connell 1994) and includes data from the Linguistic Data Consortium (LDC) Multiple Translation Arabic, Part 1 (Walker et al., 2003). The data set consists of 6 bilingual parallel corpora. Each corpus is made up of 125 source language news articles along with three translations into English, each produced independently by different human translators. However, the source news articles for each individual language corpus are different from the source articles in the other language corpora. Thus, the 6 corpora themselves are comparable to each other rather than parallel. The source languages are Japanese, Korean, Hindi, Arabic, French and Spanish. Typically, each article is between 300 and 400 words long (or the equivalent) and thus each corpus has between 150,00 and 200,000 words. Consequently, the size of the entire data set is around 1,000,000 words.</Paragraph>
    <Paragraph position="1"> Thus, for any given corpus, the annotation effort is to assign interlingual content to a set of 4 parallel texts, 3 of which are in the same language, English, and all of which theoretically communicate the same information.</Paragraph>
    <Paragraph position="2"> The following is an example set from the Spanish corpus: null S: Atribuyo esto en gran parte a una politica que durante muchos anos tuvo un &amp;quot;sesgo concentrador&amp;quot; y represento desventajas para las clases menos favorecidas.</Paragraph>
    <Paragraph position="3"> T1: He attributed this in great part to a type of politics that throughout many years possessed a &amp;quot;concentrated bias&amp;quot; and represented disadvantages for the less favored classes.</Paragraph>
    <Paragraph position="4"> T2: To a large extent, he attributed that fact to a policy which had for many years had a &amp;quot;bias toward concentration&amp;quot; and represented disadvantages for the less favored classes.</Paragraph>
    <Paragraph position="5"> T3: He attributed this in great part to a policy that had a &amp;quot;centrist slant&amp;quot; for many years and represented disadvantages for the less-favored classes.</Paragraph>
    <Paragraph position="6"> The annotation process involves identifying the variations between the translations and then assessing whether these differences are significant. In this case, the translations are, for the most part, the same although there are a few interesting variations.</Paragraph>
    <Paragraph position="7"> For instance, where this appears as the translation of esto in the first and third translations, that fact appears in the second. The translator choice potentially represents an elaboration of the semantic content of the source expression and the question arises as to whether the annotation of the variation in expressions should be different or the same.</Paragraph>
    <Paragraph position="8"> More striking perhaps is the variation between concentrated bias, bias toward concentration and centrist slant as the translation for sesgo concentrador. Here, the third translation offers a clear interpretation of the source text author's intent. The first two attempt to carry over the vagueness of the source expression assuming that the target text reader will be able to figure it out. But even here, the two translators appear to differ as to what the source language text author's intent actually was, the former referring to bias of a certain degree of strength and the second to a bias in a certain direction. Seemingly, then, the annotation of each of these expressions should differ.</Paragraph>
    <Paragraph position="9"> Furthermore, each source language has different methods of encoding meaning linguistically. The resultant differing types of translation mismatch with English should provide insight into the appropriate structure and content for an interlingual representation.</Paragraph>
    <Paragraph position="10"> The point is that a multilingual parallel data set of source language texts and English translations offers a unique perspective and unique problem for annotating texts for meaning.</Paragraph>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Interlingua
</SectionTitle>
    <Paragraph position="0"> Due to the complexity of an interlingual annotation as indicated by the differences described in the previous section, the representation has developed through three levels and incorporates knowledge from sources such as the Omega ontology and theta grids. Since this is an evolving standard, the three levels will be presented in order as building on one another. Then the additional data components will be described.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.1 Three Levels of Representation
</SectionTitle>
      <Paragraph position="0"> We now describe three levels of representation, referred to as IL0, IL1 and IL2. The aim is to perform the annotation process incrementally, with each level of representation incorporating additional semantic features and removing existing syntactic ones. IL2 is intended as the interlingua, that abstracts away from (most) syntactic idiosyncrasies of the source language. IL0 and IL1 are intermediate representations that are useful starting points for annotating at the next level.</Paragraph>
      <Paragraph position="1">  IL0 is a deep syntactic dependency representation. It includes part-of-speech tags for words and a parse tree that makes explicit the syntactic predicate-argument structure of verbs. The parse tree is labeled with syntactic categories such as Subject or Object , which refer to deep-syntactic grammatical function (normalized for voice alternations). IL0 does not contain function words (determiners, auxiliaries, and the like): their contribution is represented as features. Furthermore, semantically void punctuation has been removed. While this representation is purely syntactic, many disambiguation decisions, relative clause and PP attachment for example, have been made, and the presentation abstracts as much as possible from surface-syntactic phenomena.</Paragraph>
      <Paragraph position="2"> Thus, our IL0 is intermediate between the analytical and tectogrammatical levels of the Prague School (Hajic et al 2001). IL0 is constructed by hand-correcting the output of a dependency parser (details in section 6) and is a useful starting point for semantic annotation at IL1, since it allows annotators to see how textual units relate syntactically when making semantic judgments.</Paragraph>
      <Paragraph position="3">  IL1 is an intermediate semantic representation. It associates semantic concepts with lexical units like nouns, adjectives, adverbs and verbs (details of the ontology in section 4.2). It also replaces the syntactic relations in IL0, like subject and object, with thematic roles, like agent, theme and goal (details in section 4.3). Thus, like PropBank (Kingsbury et al 2002), IL1 neutralizes different alternations for argument realization. However, IL1 is not an interlingua; it does not normalize over all linguistic realizations of the same semantics. In particular, it does not address how the meanings of individual lexical units combine to form the meaning of a phrase or clause. It also does not address idioms, metaphors and other non-literal uses of language. Further, IL1 does not assign semantic features to prepositions; these continue to be encoded as syntactic heads of their phrases, although these might have been annotated with thematic roles such as location or time.</Paragraph>
      <Paragraph position="4">  IL2 is intended to be an interlingua, a representation of meaning that is reasonably independent of language. IL2 is intended to capture similarities in meaning across languages and across different lexical/syntactic realizations within a language. For example, IL2 is expected to normalize over conversives (e.g. X bought a book from Y vs. Y sold a book to X) (as does FrameNet (Baker et al 1998)) and non-literal language usage (e.g. X started its business vs. X opened its doors to customers). The exact definition of IL2 will be the major research contribution of this project.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.2 The Omega Ontology
</SectionTitle>
      <Paragraph position="0"> In progressing from IL0 to IL1, annotators have to select semantic terms (concepts) to represent the nouns, verbs, adjectives, and adverbs present in each sentence.</Paragraph>
      <Paragraph position="1"> These terms are represented in the 110,000-node ontology Omega (Philpot et al., 2003), under construction at ISI. Omega has been built semi-automatically from a variety of sources, including Princeton's WordNet (Fellbaum, 1998), NMSU's Mikrokosmos (Mahesh and Nirenburg, 1995), ISI's Upper Model (Bateman et al., 1989) and ISI's SENSUS (Knight and Luk, 1994). After the uppermost region of Omega was created by hand, these various resources' contents were incorporated and, to some extent, reconciled. After that, several million instances of people, locations, and other facts were added (Fleischman et al., 2003). The ontology, which has been used in several projects in recent years (Hovy et al., 2001), can be browsed using the DINO browser at http://blombos.isi.edu:8000/dino; this browser forms a part of the annotation environment. Omega remains under continued development and extension.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.3 The Theta Grids
</SectionTitle>
      <Paragraph position="0"> Each verb in Omega is assigned one or more theta grids specifying the arguments associated with a verb and their theta roles (or thematic role). Theta roles are abstractions of deep semantic relations that generalize over verb classes. They are by far the most common approach in the field to represent predicate-argument structure. However, there are numerous variations with little agreement even on terminology (Fillmore, 1968; Stowell, 1981; Jackendoff, 1972; Levin and Rappaport-Hovav, 1998).</Paragraph>
      <Paragraph position="1"> The theta grids used in our project were extracted from the Lexical Conceptual Structure Verb Database (LVD) (Dorr, 2001). The WordNet senses assigned to each entry in the LVD were then used to link the theta grids to the verbs in the Omega ontology. In addition to the theta roles, the theta grids specify the mapping between theta roles and their syntactic realization in arguments, such as Subject, Object or Prepositional Phrase, and the Obligatory/Optional nature of the argument, thus facilitating IL1 annotation. For example, one of the theta grids for the verb &amp;quot;load&amp;quot; is listed in Table 1 (at the end of the paper).</Paragraph>
      <Paragraph position="2"> Although based on research in LCS-based MT (Dorr, 1993; Habash et al, 2002), the set of theta roles used has been simplified for this project. This list (see Table 2 at the end of the paper), was used in the Inter-</Paragraph>
    </Section>
    <Section position="4" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.4 Incremental Annotation
</SectionTitle>
      <Paragraph position="0"> As described earlier, the development and annotation of the interlingual notation is incremental in nature. This necessitates constraining the types and categories of attributes included in the annotation during the beginning phases. Other topics not addressed here, but considered for future work include time, aspect, location, modality, type of reference, types of speech act, causality, etc.</Paragraph>
      <Paragraph position="1"> Thus, IL2 itself is not a final interlingual representation, but one step along the way. IL0 and IL1 are also intermediate representations, and as such are an occasionally awkward mixture of syntactic and semantic information. The decisions as to what to annotate, what to normalize, what to represent as features at each level are semantically and syntactically principled, but also governed by expectations about reasonable annotator tasks. What is important is that at each stage of transformation, no information is lost, and the original language recoverable in principle from the representation.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="0" end_page="2" type="metho">
    <SectionTitle>
5 Annotation Tool
</SectionTitle>
    <Paragraph position="0"> We have assembled a suite of tools to be used in the annotation process. Some of these tools are previously existing resources that were gathered for use in the project, and others have been developed specifically with the annotation goals of this project in mind. Since we are gathering our corpora from disparate sources, we need to standardize the text before presenting it to automated procedures. For English, this involves sentence boundary detection, but for other languages, it may involve segmentation, chunking of text, or other &amp;quot;text ecology&amp;quot; operations. The text is then processed with a dependency parser, the output of which is viewed and corrected in TrED (Hajic, et al., 2001), a graphically-based tree editing program, written in Perl/Tk  .</Paragraph>
    <Paragraph position="1"> The revised deep dependency structure produced by this process is the IL0 representation for that sentence.</Paragraph>
    <Paragraph position="2"> In order to derive IL1 from the IL0 representation, annotators use Tiamat, a tool developed specifically for  this project. This tool enables viewing of the IL0 tree with easy reference to all of the IL resources described in section 4 (the current IL representation, the ontology, and the theta grids). This tool provides the ability to annotate text via simple point-and-click selections of words, concepts, and theta-roles. The IL0 is displayed in the top left pane, ontological concepts and their associated theta grids, if applicable, are located in the top right, and the sentence itself is located in the bottom right pane. An annotator may select a lexical item (leaf node) to be annotated in the sentence view; this word is highlighted, and the relevant portion of the Omega ontology is displayed in the pane on the left. In addition, if this word has dependents, they are automatically underlined in red in the sentence view. Annotators can view all information pertinent to the process of deciding on appropriate ontological concepts in this view. Following the procedures described in section 6, selection of concepts, theta grids and roles appropriate to that lexical item can then be made in the appropriate panes.</Paragraph>
    <Paragraph position="3"> Evaluation of the annotators' output would be daunting based solely on a visual inspection of the annotated IL1 files. Thus, a tool was also developed to compare the output and to generate the evaluation measures that are described in section 7. The reports generated by the evaluation tool allow the researchers to look at both gross-level phenomena, such as inter-annotator agreement, and at more detailed points of interest, such as lexical items on which agreement was particularly low, possibly indicating gaps or other inconsistencies in the ontology being used.</Paragraph>
  </Section>
  <Section position="7" start_page="2" end_page="3" type="metho">
    <SectionTitle>
6 Annotation Task
</SectionTitle>
    <Paragraph position="0"> To describe the annotation task, we first present the annotation process and tools used with it as well as the annotation manuals. Finally, setup issues relating to negotiating multi-site annotations are discussed.</Paragraph>
    <Section position="1" start_page="2" end_page="2" type="sub_section">
      <SectionTitle>
6.1 Annotation process
</SectionTitle>
      <Paragraph position="0"> The annotation process was identical for each text. For the initial testing period, only English texts were annotated, and the process described here is for English text.</Paragraph>
      <Paragraph position="1"> The process for non-English texts will be, mutatis mutandis, the same.</Paragraph>
      <Paragraph position="2"> Each sentence of the text is parsed into a dependency tree structure. For English texts, these trees were first provided by the Connexor parser at UMIACS (Tapanainen and Jarvinen, 1997), and then corrected by one of the team PIs. For the initial testing period, annotators were not permitted to alter these structures. Already at this stage, some of the lexical items are replaced by features (e.g., tense), morphological forms are replaced by features on the citation form, and certain constructions are regularized (e.g., passive) and empty arguments inserted. It is this dependency structure that is loaded into the annotation tool and which each annotator then marks up.</Paragraph>
      <Paragraph position="3"> The annotator was instructed to annotate all nouns, verbs, adjectives, and adverbs. This involves annotating each word twice - once with a concept from Wordnet SYNSET and once with a Mikrokosmos concept; these two units of information are merged, or at least intertwined in Omega. One of the goals and results of this annotation process will be a simultaneous coding of concepts in both ontologies, facilitating a closer union between them.</Paragraph>
      <Paragraph position="4"> In addition, users were instructed to provide a semantic case role for each dependent of a verb. In many cases this was &amp;quot;NONE&amp;quot; since adverbs and conjunctions were dependents of verbs in the dependency tree. LCS verbs were identified with Wordnet classes and the LCS case frames supplied where possible. The user, however, was often required to determine the set of roles or alter them to suit the text. In both cases, the revised or new set of case roles was noted and sent to a guru for evaluation and possible permanent inclusion. Thus the set of event concepts in the ontology supplied with roles will grow through the course of the project.</Paragraph>
    </Section>
    <Section position="2" start_page="2" end_page="3" type="sub_section">
      <SectionTitle>
6.2 The annotation manuals
</SectionTitle>
      <Paragraph position="0"> Markup instructions are contained in three manuals: a users guide for Tiamat (including procedural instructions), a definitional guide to semantic roles, and a manual for creating a dependency structure (IL0). Together these manuals allow the annotator to (1) understand the intention behind aspects of the dependency structure; (2) how to use Tiamat to mark up texts; and (3) how to determine appropriate semantic roles and ontological concepts. In choosing a set of appropriate ontological concepts, annotators were encouraged to look at the name of the concept and its definition, the name and definition of the parent node, example sentences, lexical synonyms attached to the same node, and sub- and super-classes of the node. All these manuals are available on the IAMTC website  .</Paragraph>
    </Section>
    <Section position="3" start_page="3" end_page="3" type="sub_section">
      <SectionTitle>
6.3 The multi-site set up
</SectionTitle>
      <Paragraph position="0"> For the initial testing phase of the project, all annotators at all sites worked on the same texts. Two texts were provided by each site as were two translations of the same source language (non-English) text. To test for the effects of coding two texts that are semantically close, since they are both translations of the same source document, the order in which the texts were annotated differed from site to site, with half the sites marking one translation first, and the other half of the sites marking the second translation first. Another variant tested was  http://sparky.umiacs.umd.edu:8000/IAMTC/annotation _manual.wiki?cmd=get&amp;anchor=Annotation+Manual to interleave the two translations, so that two similar sentences were coded consecutively.</Paragraph>
      <Paragraph position="1"> During the later production phase, a more complex schedule will be followed, making sure that many texts are annotated by two annotators, often from different sites, and that regularly all annotators will mark the same text. This will help ensure continued inter-coder reliability.</Paragraph>
      <Paragraph position="2"> In the period leading up to the initial test phase, weekly conversations were held at each site by the annotators, going over the texts coded. This was followed by a weekly conference call among all the annotators.</Paragraph>
      <Paragraph position="3"> During the test phase, no discussion was permitted.</Paragraph>
      <Paragraph position="4"> One of the issues that arose in discussion was how certain constructions should be displayed and whether each word should have a separate node or whether certain words should be combined into a single node. In view of the fact that the goal was not to tag individual words, but entities and relations, in many cases words were combined into single nodes to facilitate this process. For instance, verb-particle constructions were combined into a single node. In a sentence like &amp;quot;He threw it up&amp;quot;, &amp;quot;throw&amp;quot; and &amp;quot;up&amp;quot; were combined into a single node &amp;quot;throw up&amp;quot; since one action is described by the combined words. Similarly, proper nouns, compound nouns and copular constructions required specialized handling. In addition, issues arose about whether annotators should change dependency trees; and in instructing the annotators on how best to determine an appropriate ontology node.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML