File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-2709_metho.xml

Size: 9,404 bytes

Last Modified: 2025-10-06 14:10:55

<?xml version="1.0" standalone="yes"?>
<Paper uid="W06-2709">
  <Title>ANNIS: Complex Multilevel Annotations in a Linguistic Database</Title>
  <Section position="4" start_page="61" end_page="62" type="metho">
    <SectionTitle>
3 ANNIS
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="61" end_page="61" type="sub_section">
      <SectionTitle>
3.1 Main Features
</SectionTitle>
      <Paragraph position="0"> ANNIS is a Java servlet application that can be accessed via standard web browsers. In its current state, it is not database-backed; data is read into memory and exploited for querying and visualization in memory.3 Data format and interoperability The data model must be suffiently expressive for capturing the data heterogeneity sketched above, including the representation of overlapping segments, intersecting hierarchies, and alternative annotations (e.g., for ambiguous annotations). It should further facilitate the addition of new annotations.</Paragraph>
      <Paragraph position="1"> In our approach, we use a flexible standoff XML format, the SFB-standard interchange format, as the interface format (Dipper, 2005). In this format, primary data is stored in a file that optionally specifies a header, followed by a tag &lt;body&gt;, which contains the source text. The format makes use of generic XML elements to encode data structures and annotations: &lt;mark&gt; (markable) tags specify text positions or spans of text (or spans of other markables) that can be annotated by linguistic information. Trees and graphs are encoded by &lt;struct&gt; (structure) and &lt;rel&gt; (relation) elements, which specify local subtrees. &lt;feat&gt; (feature) tags specify the information that is annotated to markables or structures, which are referred to by xlink attributes. Each type of annotation is stored in a separate file, hence, competing or ambiguous annotations can be represented in a straightforward way: by distributing them over different files.</Paragraph>
      <Paragraph position="2"> Our format allows us to represent different kinds of annotations in a uniform way. We pro3For a more elaborate discussion of the basic concepts of ANNIS, see (Dipper et al., 2004).</Paragraph>
      <Paragraph position="3"> vide importers for the export format of the annotation tools annotate, EXMARaLDA, RST Tool, and MMAX. Our PCC corpus (see sec. 4) imports and synchronizes the following annotations, which have been annotated by these tools: syntax, information structure, rhetorical structure, and coreference.</Paragraph>
      <Paragraph position="4"> Visualization Suitable means for visualizing information is crucial for exploring and interpreting linguistic data. Due to the high degree of data heterogeneity, special attention has been paid to the support of visualizing various data structures.</Paragraph>
      <Paragraph position="5"> In addition, annotations may refer to segments of different sizes, e.g. syntax vs. discourse structure. Furthermore, richness of information in multilevel annotations has to be taken into account; this requires a certain degree of user-adaptivity, allowing the user to modify the way information of interest is displayed.</Paragraph>
      <Paragraph position="6"> In ANNIS, we start from a basic interactive tier-based view, which allows for a compact simultaneous representation of many annotation types and whose appearance can be modified by the user in a format file. In addition, a discourse view helps the user to orient him/herself in the discourse. Further views can be added.</Paragraph>
      <Paragraph position="7"> Query support Among the numerous requirements for a good query facility for multilevel annotation, expressiveness, efficiency, and user-friendly query-formulation appear to be the most relevant. Even a very brief discussion of these issues would go beyond the limits of this paper, the reader is instead referred to (Heid et al., 2004).</Paragraph>
      <Paragraph position="8"> Currently, ANNIS uses a query language prototype which allows the user to query text and annotations, by means of regular expressions and wildcards, and various common relational operators (e.g. for stating relations in tree structures, such as dominance or sibling relations). However, the set for querying sequential relations is not sufficiently expressive, and querying co-reference relations is not supported yet. Furthermore, user support for formulating queries is rather poor.</Paragraph>
    </Section>
    <Section position="2" start_page="61" end_page="62" type="sub_section">
      <SectionTitle>
3.2 Open Issues
</SectionTitle>
      <Paragraph position="0"> Data alignment Alignment of annotations created by different annotation tools appears to be most suitable at the level of tokens. However, tools often come with their own tokenizers and mismatches do occur frequently. We currently use a  simple script that checks for text and token identity in the standoff files that we generate from the output of the individual tools. However, all mismatches have to be corrected manually. At least for white-space differences, an automatic fixing procedure should be feasible (similar to the one implemented by (Witt et al., 2005)).</Paragraph>
      <Paragraph position="1"> Efficient Querying Current querying is restricted to rather small amounts of data, and complex queries may take some time until finishing the search.</Paragraph>
      <Paragraph position="2"> Overlapping elements and intersecting hierarchies The query language does not yet support comfortable searching for overlapping elements.</Paragraph>
      <Paragraph position="3"> However, exactly what kinds of queries on overlapping segments or intersecting relations should be supported is an open question.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="62" end_page="63" type="metho">
    <SectionTitle>
4 Use Cases
</SectionTitle>
    <Paragraph position="0"> We illustrate the use of ANNIS in linguistic research, exemplified with research questions from three different linguistic areas.</Paragraph>
    <Paragraph position="1"> Historical investigations The project B4: The role of information structure in the development of word order regularities in Germanic investigates the verb-second phenomenon, which occurred in certain Germanic languages only (e.g., it did in Modern German, but not in Modern English). One of their findings is that verb placement in the Old High German translation of Tatian correlates with discourse relations: verb-initial sentences usually occur in narrative contexts and signal continuation of the story. In contrast, verb-second sentences indicate subordinative relations (Hinterh&amp;quot;olzl and Petrova, 2005).</Paragraph>
    <Paragraph position="2"> Typological studies In the research project D2: Typology of Information Structure (cf., e.g., (G&amp;quot;otze et al., To appear)), a typological questionnaire is designed, with which language data can be elicited using largely language-independent methods. Currently, data from 13 different languages is elicited and annotated with information from various linguistic levels (morphosyntax, phonology, semantics, and information structure).</Paragraph>
    <Paragraph position="3"> An interesting query might look for nominal phrases (const=np) that are new in the discourse (given=new) and belong to the (information-) focus of a sentence (focus=ans), e.g. for investigating the phonological realization of these.</Paragraph>
    <Paragraph position="4">  The according query has the form: const=np &amp; given=new &amp; focus=ans &amp; #1 = #2.4 Queries in ANNIS can be restricted to subsets of a corpus, by queries such as focus=ans &amp; doc=*81-11*, which searches for all answer foci in the data that has been elicited by means of the task 81-11 in the questionnaire, yielding matching data from all languages in our database.</Paragraph>
    <Paragraph position="5"> Discourse studies The Potsdam Commentary Corpus, PCC (Stede, 2004), consists of 173 newspaper commentaries, annotated for morphosyntax, coreference, discourse structure according to Rhetorical Structure Theory, and information structure.</Paragraph>
    <Paragraph position="6"> A question of interest here is the informationstructural pattern of sentences introducing discourse segments that elaborate on another part of the discourse: elaboration &amp; rel=satellite &amp; (cat=vroot &amp; aboutness-topic) &amp; #1 &gt; #2 &amp; #2 = #3. Another research issue is the relationship of coreference and discourse structure. However, querying for coreference relations is not supported yet.</Paragraph>
  </Section>
  <Section position="6" start_page="63" end_page="63" type="metho">
    <SectionTitle>
5 Future Work
</SectionTitle>
    <Paragraph position="0"> Currently we are working on integrating a native XML database into our system. To make processing more efficient, we are developing an internal inline representation of the standoff interchange format, encoding overlapping segments by means of milestones or fragments (Barnard et al., 1995).</Paragraph>
    <Paragraph position="1"> Furthermore, the query language will be extended to cover different kinds of queries on sequential relations as well as coreference relations.</Paragraph>
    <Paragraph position="2"> Finally, we will add basic statistical means to the query facility, which, e.g., can point to rare and, hence, potentially interesting feature combinations. null</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML