File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-2713_metho.xml

Size: 12,099 bytes

Last Modified: 2025-10-06 14:10:54

<?xml version="1.0" standalone="yes"?>
<Paper uid="W06-2713">
  <Title>Representing and Accessing Multilevel Linguistic Annotation using the MEANING Format</Title>
  <Section position="4" start_page="0" end_page="77" type="metho">
    <SectionTitle>
2 The MEANING Format
</SectionTitle>
    <Paragraph position="0"> Following the proposals for the ISO/TC 37/SC 4 standard for linguistic resources (Ide and Romary, 2002), the MAF scheme is based on annotation structures and data categories. Each type of annotation structure (nestable &lt;struct&gt; elements) corresponds to a specific kind of linguistic object (e.g. tokens, lexical units, multiwords), and each instance of a linguistic object is identified by a unique identifier. Data categories (&lt;feat&gt; tags) represent attributes of the linguistic objects. Different representation levels are contained in separate documents, or document sections. The XLink and XPointer syntax is used to represent relations between elements in different XML documents, and IDREFs attributes for relations within the same document.</Paragraph>
    <Section position="1" start_page="0" end_page="77" type="sub_section">
      <SectionTitle>
2.1 First version
</SectionTitle>
      <Paragraph position="0"> The first version of the MEANING Format has been used to represent seven kinds of information: orthographic features, the structure of the  text, morphosyntactic information, multiwords, syntactic information, named entities, and word senses.</Paragraph>
      <Paragraph position="1"> Annotation levels are related to each other following a hierarchy of annotation levels, which reflects a theoretically grounded hierarchy of linguistic objects. The basic (orthographic) annotation level, representing tokens, is implemented with pointers to the character positions in the hub corpus. Then the morphosyntactic level, representing word-related morphological information, contains pointers to the tokens, whereas the multiword level points to the words described at morphosyntactic level.</Paragraph>
      <Paragraph position="2"> The following example shows how the morphosyntactic features of the Italian word &amp;quot;andare&amp;quot; (to go) are represented.</Paragraph>
      <Paragraph position="3">  of discontinuous units, such as for instance non-contiguous multiwords; see &amp;quot;andarci veramente piano&amp;quot; (take it really easy). A detailed study of how standoff annotation allows for an elegant treatment of this phenomenon can be found in (Pianta and Bentivogli 2004).</Paragraph>
    </Section>
    <Section position="2" start_page="77" end_page="77" type="sub_section">
      <SectionTitle>
2.2 Second version
</SectionTitle>
      <Paragraph position="0"> The first version of the MEANING Format has recently been extended within the FU-PAT ON-TOTEXT project (Magnini et al. 2005).</Paragraph>
      <Paragraph position="1"> Within this project, we are creating the Italian Content Annotation Bank (I-CAB), a corpus of Italian news stories annotated with different kinds of semantic information. Annotation is being carried out manually, as we intend I-CAB to become a benchmark for automatic Information Extraction and Ontology Population tasks, including recognition and normalization of various types of entities, temporal expressions, relations between entities, and relations between entities and temporal expressions (e.g. the relation dateof-birth connecting a person to a date).</Paragraph>
      <Paragraph position="2"> To fulfill I-CAB annotation needs, we extended MAF, by adding a number of new linguistic annotation levels, i.e.: * temporal expressions * entities of type person and organization * mentions (i.e. the textual expressions referring to the entities) According to the hierarchical approach to representing relations between annotation levels in the first version of the MEANING Format, temporal expressions and entity mentions are represented with pointers to morphosyntactic level entities. Entities, instead, are represented with pointers to entity mentions.</Paragraph>
      <Paragraph position="3"> To manually annotate temporal expressions we followed the TIMEX2 markup standard, while to mark entities and mentions we relied on the ACE entity detection task guidelines. To perform the annotation task we used Callisto (http://callisto.mitre.org).</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="77" end_page="78" type="metho">
    <SectionTitle>
3 Converting linguistic annotations into
MAF
</SectionTitle>
    <Paragraph position="0"> The manual annotations produced through Callisto, which is related to novel annotation levels such as temporal expressions and entity mentions, had to be integrated with more traditional annotations which are performed automatically with the TextPro tool, an automatic linguistic analysis Tool Suite developed at ITC-irst.</Paragraph>
    <Paragraph position="1">  As one can see in the above figure, two different annotation processes (automatic and manual) produce two different formats which must be converted and integrated into MAF in order to be accessed by the MEANING Browser (or any other NLP tool).</Paragraph>
    <Section position="1" start_page="77" end_page="78" type="sub_section">
      <SectionTitle>
3.1 From TextPro format to MEANING
Format
</SectionTitle>
      <Paragraph position="0"> TextPro takes a raw text as input and carries out basic processing tasks such as tokenization, mor- null phological analysis, PoS tagging, lemmatization, and multiword recognition. The results of TextPro analyses are represented in a table, where each token is on a row, and columns contain multiple annotation levels. Converting from the TextPro to the MEANING Format requires retrieving the character positions of tokens in the hub corpus, which are not directly available in the TextPro output.</Paragraph>
    </Section>
    <Section position="2" start_page="78" end_page="78" type="sub_section">
      <SectionTitle>
3.2 From AIF format to MEANING format
</SectionTitle>
      <Paragraph position="0"> The Callisto manual annotation tool produces a coding format called AIF (Atlas Interchange Format), which implements a stand-off XML annotation scheme.</Paragraph>
      <Paragraph position="1"> When using the Callisto graphical interface, all annotations of temporal expressions and entity mentions are carried out by selecting a sequence of contiguous characters. As a consequence, all AIF annotations make reference to character positions.</Paragraph>
      <Paragraph position="2"> However, from Section 2.2 we know that in MAF temporal expressions and entity mentions make reference to morphosyntactic linguistic objects, not characters. This implies that, to go from AIF to the MEANING Format, we need to translate annotations making reference to the position of characters into annotations that point to morphological entities. More precisely, we need to substitute pointers to character positions with pointers to morphosyntactic objects which have been marked automatically by TextPro. Carrying out this step will also achieve the integration of manual and automatic annotations.</Paragraph>
      <Paragraph position="3"> The integration step is possible because the MAF hierarchy of annotation levels points, at the lowest level, to character positions. By following the hierarchy of links relating the various annotation levels it is always possible to trace back a linguistic object to some sequence of characters in the raw text, and in the opposite direction, given a string, we know what linguistic objects correspond to it. Summing up, the integration of AIF annotations into MAF requires that, given the character positions contained in the AIF annotation of some string, we substitute the pointers to characters with the pointers to the linguistic objects that cover the same string.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="78" end_page="78" type="metho">
    <SectionTitle>
4 Data Access
</SectionTitle>
    <Paragraph position="0"> MAF turned out to be a flexible and expressive means to represent and integrate multiple levels of linguistic annotation. This was achieved mainly thanks to the adoption of the standoff annotation approach. However accessing and retrieving information spread in possibly very large repositories (hundreds of thousands) of XML files may be a challenging task even for Database Management Systems specifically designed to handle XML. To solve this problem we first analyzed existing native XML databases such as eXist, and Apache Xindice, but found that what was available at the time did not suited our needs. For this reason we approached the access problem through a two-fold strategy: * converting XML data into a relational database * indexing XML data and accessing them through a search engine (LUCENE) The conversion of MAF data into a relational database is based on the following strategy. Each annotation level is mapped into a table, where rows represent instances of the relevant linguistic object (e.g. words), and columns represent its attributes (e.g. lemma, PoS, etc). Specific columns contain the object identifiers and the pointers to objects of other types/tables.</Paragraph>
    <Paragraph position="1"> Once MAF data are stored in a relational database, they can be accessed quite efficiently. However, when the access to data requires joins of many tables, access times become incompatible with various kinds of applications, such as on-line corpus browsing. For this reason we tried to complement the use of a relational database with the exploitation of the indexing capability of the LUCENE search engine (http://lucene.apache.org/). To this extent we modified the LUCENE analyzer so as to be able to parse XML structures. In this way LUCENE can be configured in order to index any XML structure.</Paragraph>
    <Paragraph position="2"> The fast access capabilities of a relational database combined with the extended indexing capabilities of LUCENE enabled us to implement a browser of MAF annotated corpora.</Paragraph>
  </Section>
  <Section position="7" start_page="78" end_page="79" type="metho">
    <SectionTitle>
5 The MEANING Browser
</SectionTitle>
    <Paragraph position="0"> The MEANING Browser can be used by humans to navigate any corpus encoded with MAF. The browser is built upon an API which can be used by any automatic system.</Paragraph>
    <Paragraph position="1"> In the following, we are going to demonstrate how I-CAB texts and their annotations can be accessed through the MEANING Browser.</Paragraph>
    <Paragraph position="2"> The first kind of access to the corpus is wordoriented, and amounts to a concordancer, i.e. a  tool able to provide all the occurrences of a certain word in the corpus. The user can alternatively search for all occurrences of a word form, or a lemma, possibly constraining the search to a certain PoS. Free combinations between these constraints are allowed. The system will return a KWIC-like concordance of all the tokens in the corpus that match the request, within a chosen word window. By clicking on the magnifying glass, one can see the sentence in which the searched word occurs (see Appendix 1).</Paragraph>
    <Paragraph position="3"> By clicking on a specific icon a new window is opened where the whole text is displayed and its linguistic annotations are made accessible. A number of graphical widgets allow the user to highlight the desired annotations: e.g. nouns, verbs, multiwords, temporal expressions, mentions of a specific entity.</Paragraph>
    <Paragraph position="4"> In Appendix 2 the browser is used to show both nouns (automatically annotated) and entity mentions (from manual annotation). Appendix 3 shows time expressions and discontinuous multiwords; see how the multiword &amp;quot;ha rassegnato ... le dimissioni&amp;quot; (he resigned) is made discontinuous by the occurrence of a time expression ieri (yesterday). The browser will also give morphosyntactic information about single words composing multiwords (governo, government).</Paragraph>
    <Paragraph position="5"> From the same window one can access the XML files encoding multiple annotation levels for the same document.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML