Multidimensional markup and heterogeneous linguistic resources
Maik St¨uhrenberg Andreas Witt Daniela Goecke
Faculty for Linguistics and Literature
Bielefeld University, Germany
{maik.stuehrenberg|andreas.witt|daniela.goecke|dieter.metzing|oliver.schonefeld}@uni-bielefeld.de
Dieter Metzing Oliver Schonefeld
Abstract
The paper discusses two topics: firstly an
approach of using multiple layers of an-
notation is sketched out. Regarding the
XML representation this approach is sim-
ilar to standoff annotation. A second topic
is the use of heterogeneous linguistic re-
sources (e.g., XML annotated documents,
taggers, lexical nets) as a source for semi-
automatic multi-dimensional markup to
resolve typical linguistic issues, dealing
with anaphora resolution as a case study.1
1 Introduction – Why (and how) to use
heterogeneous linguistic resources
A large and diverse amount of linguistic resources
(audio and video recodings, textual recordings)
has been piled up during various projects all over
the world. A reasonable subset of these resources
consists of machine-readable structured linguis-
tic documents (often XML annotated), dictionar-
ies, grammars or ontologies. Sometimes these
are available to the public on the Web, cf. Si-
mons (2004). The availability allows for the so-
phisticatedexaminationoflinguisticquestionsand
the reuse of existing linguistic material. Espe-
cially corpora annotated for discourse-related phe-
nomena have become an important source for var-
ious linguistic studies. Besides annotated corpora
external knowledge bases like lexical nets (e.g.,
WordNet, GermaNet) and grammars, can be used
to support several linguistic processes.
Although XML has recently established itself
as the technology and format of choice the be-
fore mentioned resources remain heterogeneous
1The work presented in this paper is part of the project A2
Sekimo of the Research Group Text-technological modelling
of information funded by the German Research Foundation.
in respect of the data format (i.e., the underlying
schema) or the functionality provided. A simple
approach to make use of different resources is to
usethemonebyone,startingwithrawtextdata(or
annotated XML) as input and providing the output
of the first process (e.g., a tagger) as input for the
next step (e.g., a parser). However, this method
may lead to several problems. One possible prob-
lem of this method is that the output format of one
processing resource can be unemployable as input
format for the next. Another potential problem of
using XML annotated documents is overlapping
annotation. And finally it is sometimes necessary
(or desirable) to process only parts of the input
document.
The structure of the paper is as follows: In Sec-
tion 2 our approach of representing multiple anno-
tations is desribed, in Section 3 the use of multi-
root trees for the representation of heterogeneous
ressources is presented. As a case study, the res-
olution of anaphoric relations is described in Sec-
tion 4.
2 Multiple annotations
Representing data corresponding to different lev-
els of annotation is a fundamental problem of text-
technological research. Renear et al. (1996) dis-
cuss the OHCO-Thesis2 as one of the basic as-
sumptionsaboutthestructureoftextandshowthat
thisassumptioncannotbeupheldconsistently. Be-
ing based on the OHCO-Thesis most markup lan-
guages (including SGML and XML) are designed
in principle to represent one structure per docu-
ment. Options to resolve this problem are dis-
cussed in Barnard et al. (1995) and several other
2The OHCO-Thesis, first presented in DeRose et al.
(1990) states ”that text is best represented as an ordered hier-
archy of content object (OHCO).”
85
proposals. To avoid drawbacks of the above men-
tioned approaches Witt (2002; 2004) discusses an
XML based solution which is used in our project.
2.1 Representation
We address the issue of overlapping markup by
using separate annotations of relevant phenomena
(e.g., syntactic information, POS, document struc-
ture) according to different document grammars,
i.e., the same textual data is encoded several times
(in separate files). One advantage of this multiple
annotation is that the modeling of information on
a level A is not dependent in any way on a level
B (in contrast to the standoff annotation model
described by Thompson and McKelvie (1997),
where a primary modeling level is needed). Addi-
tional annotation layers can be added easily with-
out any changes to the layers already established.
The primary data (i.e., the text which will be an-
notated) is separated from the markup and serves
as key element for establishing relations between
the annotation tiers. Witt et al. (2005) describe a
knowledge representation format (in the program-
minglanguageProlog3)whichcanbeusedtodraw
inferences between separate levels of annotations
and in which the parts of text (the PCDATA in
XML terms) are used as an absolute reference sys-
tem and as link between the levels of annotation
(cf. Bayerl et al. (2003)).
This representation format allows us to use var-
ious linguistic resources. A Python script con-
verts the different annotation layers (XML Docu-
ments) to the above mentioned Prolog representa-
tion which serves as input for the unification pro-
cess. The elements, attributes and text of all an-
notation layers are stored as Prolog predicates. As
a requirement all files must be identical with re-
spect to their underlying character data, what we
call identity of primary data.
2.2 Unification
Figure 1 shows the architecture used for the uni-
fication process. Different annotation layers are
unified, i.e., merged into one output fact base.
A script reconverts the Prolog representation into
one well-formed XML document containing the
annotations from the layers plus the textual con-
tent. In case of overlapping elements (which may
be the result of the unification), it converts those
3An additional representation format integrates the repre-
sentation of multi-rooted trees developed in the NITE-Project
(cf. Carletta et al. (2003)).
elements to milestones or fragments (cf. TEI
Guidelines (2004)) according to parameter op-
tions. A Java-based GUI is available to ease the
use of the above mentioned framework.
3 Multi-rooted trees
Based on the before mentioned architecture we
now focus on the usage of heterogeneous linguis-
tic resources (as described in in Section 1) in or-
der to semi-automatically add layers of markup to
various XML (or plain text) documents. We pre-
fer the term multi-rooted trees in favor of multi-
ple annotations, i.e., different layers of annotation
are stored in a single representation (based on the
above mentioned architecture). The input docu-
ment is separated into the textual information and
the annotation tree (if there is any). Both of these
are provided as input for linguistic resources. The
output of this process (typically an XML anno-
tated document) is again separated into text and
markup and serves as input for another resource.
Figure 2 gives an overview over the process.
4 Heterogeneous linguistic resources for
anaphora resolution: a case study
We use multi-rooted trees in order to annotate and
model coreferential phenomena and anaphoric re-
lations (cf. Sasaki et al (2002)). Base for the
resolution of anaphoric relations (both pronomi-
nal anaphora and definite description anaphora) is
a small test corpus containing German newspaper
articles and scientific articles in German and En-
glish. Figure 3 shows an excerpt of a German
newspaper article taken from “Die Zeit”. In this
example the first linguistic resource to apply is
a parser (in our case Machinese Syntax by Con-
nexor Oy). As a second step an XSLT script uses
the input document and the parser output and tags
discourse entities (see element de in Figure 3) by
judging several syntactic criteria provided by the
parser. The discourse entities mark the starting
point to determine anaphora-antecedent-relations
between pairs of discourse entities.
In order to resolve bridging relations (e.g.,
“door” as a meronym of “room”), WordNet (Ger-
maNet for German texts) is used as a linguistic
resource to establish relationships between dis-
course entities according to the information stored
in the synsets4. By now, we use an Open Source
4A synset represents a concept and consists of a set of one
or more synonyms.
86
Figure 1: Overview of the architecture
Figure 2: Using heterogeneous linguistic resources
1 <doc_article>
2 <doc_sect1>
3 <doc_title/>
4 <doc_para>
5 <chs_sentence id="s22">
6 <chs_de deLemma="rolf" deID="de_n_078" deType="nom">Marie Rolfs</chs_de> ist <
chs_de deLemma="jahr" deID="de_n_079" deType="nom">vier Jahre</chs_de> alt.
7 </chs_sentence>
8 <chs_sentence id="s23">
9 Mit <chs_de deLemma="petra" deID="de_n_080" deType="nom">ihrer
Zwillingsschwester Petra</chs_de>, <chs_de deLemma="bruder" deID="de_n_081"
deType="nom">ihrem achtj¨ahrigen Bruder</chs_de> und <chs_de deLemma="mutter"
deID="de_n_082" deType="nom">ihrer Mutter Sabine</chs_de> wohnt <chs_de deLemma
="sie" deID="de_n_083" deType="nom">sie</chs_de> in <chs_de deLemma="
dreizimmerwohnung" deID="de_n_084" deType="nom">einer Dreizimmerwohnung in <
chs_de deLemma="rahlstedt" deID="de_n_085" deType="nom">Rahlstedt a<chs_de
deLemma="stadt#rand" deID="de_n_086" deType="nom">m ¨ostlichen Hamburger
Stadtrand</chs_de> </chs_de> </chs_de>.
10 </chs_sentence>
11 <chs_semRel>
12 <chs_cospecLink relType="poss" phorIDRef="de_n_080" antecedentIDRefs="de_n_078"/>
13 <chs_cospecLink relType="poss" phorIDRef="de_n_081" antecedentIDRefs="de_n_078"/>
14 <chs_cospecLink relType="poss" phorIDRef="de_n_082" antecedentIDRefs="de_n_078"/>
15 <chs_cospecLink relType="ident" phorIDRef="de_n_083" antecedentIDRefs="de_n_078"/
>
16 </chs_semRel>
17 </doc_para>
18 </doc_sect1>
19 </doc_article>
Figure 3: An extract of a German newspaper article marked up by several linguistic resources
87
native XML database5 as test tool for querying the
GermaNet data6. Resolving synonymous or hy-
peronymous anaphoric relations is done by using
XPath or XQuery queries on pairs of discourse en-
tities. Bridging relations are harder to track down
and will be focused on in the near future.
Figure 3 shows the shortened and manually re-
vised output of the anaphora resolution. In this
example two annotation layers have been merged:
the logical document structure (in our case a mod-
dified version of DocBook, doc) and the level
of semantic relations (chs). The logical docu-
ment structure describes the organisation of the
text document in terms of chapters, sections, para-
graphs, and the like. The level of semantic re-
lations describes discourse entities and relations
between them. Corpus investigations give rise to
the supposition that the logical text structure in-
fluences the search scope of candidates for an-
tecedents. Anaphoric relations are annotated with
a cospecLink element (lines 12 to 15). The
attribute relType holds the type of relation be-
tween two discourse entities. Line 15 is an exam-
ple of an identity relation between discourse entity
de n 078 (“Marie Rolfs”, line 6) and discourse
entity de n 083 (“sie”, line 9) whereby the first
is marked as antecedent.
5 Conclusion
The architecture shown in this paper provides ac-
cess to multiple layers of linguistic annotation and
allows for the reuse and integration of existing lin-
guistic resources. The resulting additional annota-
tion layers are extremely useful for solving com-
plex linguistic issues like anaphora resolution. It
is our goal to enable a semi-automatic anaphora
resolution by the end of the project life-span.

References
Barnard, David, Burnard, Lou, Gaspart, Jean-Pierre
Gaspart, Price, LynneA., Spergerg-McQueen, C.M.
andGiovanniBattistaVarile. 1995. Hierarchicalen-
coding of text: Technical problems and SGML solu-
tions. Computers and the Humanities, 29(3):211–
231.
Bayerl, Petra Saskia, L¨ungen, Harald, Goecke,
Daniela, Witt, Andreas and Daniel Naber. 2003.
5eXist (http://www.exist-db.org)
6GermaNet is available as an XML representation which
can be stored (and queried) in the above mentioned (and ad-
ditional) native XML database.
Methods for the semantic analysis of document
markup. In Proceedings of the 2003 ACM sym-
posium on Document engineering, pages 161–170,
New York, NY, USA. ACM Press.
Carletta, Jean Kilgour, Jonathan, O’Donnel, Timothy
J., Evert, Stefan and Holger Voormann. 2003. The
NITEObjectModelLibraryforHandlingStructured
Linguistic Annotation on Multimodal Data Sets. In
Proceedings of the EACL Workshop on Language
Technology and the Semantic Web (3rd Workshop on
NLP and XML (NLPXML-2003)), Budapest, Hun-
gary.
DeRose, Steven J., Durand, David G., Mylonas, Elli
and Allen H. Renear. 1990. What is text, really?
Journal of Computing in Higher Education, 1(2):3–
26.
Renear, Allen, Mylonas, Elli and David Durand. 1996.
Refining our notion of what text really is: The prob-
lem of overlapping hierarchies. Research in Hu-
manities Computing. Selected Papers from the ALL-
C/ACH Conference, Christ Church, Oxford, April
1992, 4:263–280.
Sasaki, Felix, Wegener, Claudia, Metzing, Dieter Met-
zing and Jens P¨onninghaus. 2002. Co-reference
annotation and resources: A multilingual corpus of
typologically diverse languages. In Proceedings of
the 3nd International Conference on Language Re-
sources and Evaluation (LREC 2002), pages 1225–
1230, Las Palmas.
Simons, Gary, Lewis, William, Farrar, Scott, Langen-
doen, Terence, Fitzsimons, Brian and Hector Gon-
zalez. 2004. The Semantics of Markup: Map-
pingLegacyMarkupSchemastoaCommonSeman-
tics. In Graham Wilcock, Nancy Ide and Laurent
Romary, editor, Proceedings of the 4th Workshop
on NLP and XML (NLPXML-2004), pages 25–32,
Barcelona, Spain, July. Association for Computa-
tional Linguistics.
Sperberg-McQueen, C. M. and Lou Burnard, editor.
2004. Guidelines for Text Encoding and Inter-
change. published for the TEI Consortium by Hu-
manities Computing Unit, University of Oxford.
Thompson, Henry S. and David McKelvie. 1997. Hy-
perlink semantics for standoff markup of read-only
documents. In Proceedings of SGML Europe ’97:
The next decade – Pushing the Envelope, pages227–
229, Barcelona.
Witt, Andreas, Goecke, Daniela, Sasaki, Felix and Har-
ald L¨ungen. 2005. Unification of XML Documents
with Concurrent Markup. Literary and Lingustic
Computing, 20(1):103–116.
Witt, Andreas. 2002. Meaning and interpretation of
concurrent markup. In ALLCACH2002, Joint Con-
ference of the ALLC and ACH, T¨ubingen.
Witt, Andreas. 2004. Multiple hierarchies: new as-
pects of an old solution. In Proceedings of Extreme
Markup Languages.
