Layering and Merging Linguistic Anotations 
 
Keith Suderman 
Department of Computer Science 
Vasar College 
Poughkeepsie, NY USA 
suderman@cs.vassar.edu 
Nancy Ide 
Department of Computer Science 
Vasar College 
Poughkeepsie, NY USA 
ide@cs.vassar.edu 
 
 
Abstract 
The American National Corpus and its 
annotations are represented in a stand-off 
XML format compliant with the specifi-
cations of ISO TC37 SC4 WG1’s Lin-
guistic Anotation Framework. Because 
few systems that enable search and ac-
cess of the corpus currently suport 
stand-off markup, the project has devel-
oped a SAX like parser that generates 
ANC data with annotations in-line, in a 
variety of output formats. 
1 Introduction 
The American National Corpus (ANC) project
1
 
recently released its 2
nd
 release consisting of ap-
proximately 2 milion words of data, represent-
ing a variety of genres of both writen and spo-
ken data. The corpus is annotated with several 
layers of automatically produced linguistic in-
formation, including sentence and token bounda-
ries, part of speech using two different POS tag-
sets (a version of the Penn tagset
2
 and the Biber 
tagset
3
), and noun chunks and verb chunks. 
ANC primary documents are plain text (UTF-
16) documents and are treated as “read only” 
resources. Al annotations are represented in 
stand-off XML documents referencing spans in 
the primary data or other annotation documents, 
using the XCES
4
 implementation of the specifi-
cations of ISO TC37 SC4’s Linguistic Anota-
tion Framework (LAF) (Ide and Romary, 204). 
Because few systems that enable search and ac-
cess of the corpus currently suport stand-off 
markup, the project has developed a parser that 
generates ANC data with annotations in-line, in a 
variety of output formats. 
                                                
1
htp:/ americanationalcorpus.org 
2
htp:/americanationalcorpus.org/FirstRelease/gatetags.txt 
3
htp:/americanationalcorpus.org/FirstRelease/Biber-
tags.txt 
4
htp:/ww.xces.org 
This demonstration wil show the “life-cycle” 
of an ANC document, from acquisition of a 
document in any of a variety of formats (MS 
Word, PDF, HTML, etc.) through annotation and 
final representation in the stand-off format. The 
ANC tol for merging annotations of the user’s 
choice with the primary data to produce a single 
document with in-line annotations wil also be 
demonstrated. 
2 ANC Document Life-Cycle 
Documents to be included in the ANC are ac-
quired in many different formats, including MS 
Word, PDF, HTML, Quark Express, etc. Proc-
essing involves a series of steps, which are out-
lined below. 
2.1 Conversion from original format to 
“rudimentary” XML 
The ANC receives documents in a variety of dif-
ferent formats. The first step in processing is to 
convert the input documents into XCES XML 
with basic structural annotations included. The 
most common types of file formats encountered 
are: 
• Microsoft Word. The release of OpenOf-
fice 2 has greatly simplified the processing 
of MS Word documents. OpenOffice uses 
XSL and XSLT stylesheets to export files 
to XML and ships with stylesheets to gen-
erate DocBook and TEI-compliant for-
mats. We modified the TEI stylesheet to 
create XCES XML. OpenOffice’s Java 
API enables us to automate and integrate 
OpenOffice with later processing steps. 
• XML/SGML/HTML. processing of 
XL files typically involves using XSLT 
to map element names to XCES. SGML 
and HTML files typically require pre-
processing to render them into valid XML, 
folowed by the application of an XSLT 
stylesheet to convert them to XCES. 
89
• Quark Express. Several publishers pro-
vided documents prepared for publication 
using Quark Express. Quark documents 
can be exported in XML, but doing so is 
worthwhile only if the creator of the 
document takes advantage of Quark’s 
style-definition facilities (which was not 
the case for any of the contributed Quark 
documents). We therefore exported the 
documents in RTF; however, many fonts 
and special characters are improperly ren-
dered, and fairly extensive manual editing 
was therefore required to render the files 
into a format that could be used. Once ed-
ited, the same procedures for MS Word 
documents are used to generate XCES. 
• PDF.  Bitmap PDF files are unusable for 
our purposes. Adobe Acrobat can generate 
plain text from PDF, although this process 
loses much of the formatting information 
that would be desirable to retain to facili-
tate later processing. In some cases, liga-
tures and other special characters are im-
properly represented in the text version, 
and it is not always posible to automati-
cally detect and convert them to conform 
to the original. PDF documents with two 
or more columns cannot, to our knowl-
edge, be extracted without some mis-
ordering of the text in the results. 
• Other formats. Other formats in which 
the ANC has acquired documents include 
plain text and plain text that employed a 
variety of proprietary markup languages. 
These documents are processed on a case 
by case basis, using specialized scripts. 
2.2 GATE processing and annotation 
We use the University of Sheffield’s GATE sys-
tem
5
 for the bulk of ANC document processing 
and annotation, currently including tokenization, 
sentence spliting, part of speech tagging, noun 
chunking, and verb chunking. Most annotations 
are produced using GATE’s built-in ANIE 
components; we have, however, modified the 
ANIE sentence spliter and created several Java 
plug-ins for use in GATE that perform basic 
bokeeping, renaming of annotations/features, 
moving of annotations between annotation sets 
etc. We have also developed a scripting language 
(XORO
6
) for use with GATE to enable easy bulk 
                                                
5
htp:/gate.ac.uk 
6
 htp:/ americanationalcorpus.org/xoro.html 
processing and re-processing of the entire cor-
pus, or to apply selected annotation steps without 
having to load the files into a GATE corpus or 
data store. This eases iterative development as 
documents are added and tols are refined. 
2.3 Creation of standoff annotation docu-
ments 
We have developed several custom processing 
resources that plug into GATE to generate stand-
off annotations in the XCES implementation of 
the LAF format. The last step in our GATE pipe-
line is to create the primary text document and 
generate all the required standoff annotation 
files. 
3 Standoff Format 
The ANC standoff format for annotations is a 
simple graph representation, consisting of one 
node set and one, or more, edge sets. The node 
set is represented by the text itself, with an im-
plied node between each character. Each edge set 
is represented by an XML document and may 
contain one or more annotation types: logical 
structure, sentence boundaries, tokens, etc. 
An ANC header file for each document is used 
to associate the source text with the standoff an-
notation documents; for example: 
<cesHeader> 
 .. 
 <anotations> 
 <anotation type="content" 
  an.loc="en_4065.txt"> 
  Text content</anotation> 
 <anotation type="logical" 
  an.loc="en_4065-logical.xml"> 
  Logical structure</anotation> 
   <anotation type="s" 
  an.loc="en_4065-s.xml"> 
  Sentence boundaries</anotation> 
   <anotation type="heple" 
  an.loc="en_4065-heple.xml"> 
  Heple POS tags</anotation> 
   <anotation type="biber" 
  an.loc="en_4065-biber.xml"> 
  Biber POS tags</anotation> 
   <anotation type="vp" 
  an.loc="en_4065-vp.xml"> 
  Verb chunks</anotation> 
   <anotation type="np" 
  an.loc="en_4065-np.xml"> 
  Noun chunks</anotation> 
 </anotations> 
   .. 
</cesHeader> 
ANC annotation documents are marked up 
with the XCES representation of the nodes and 
edge sets of the annotation graph. The folowing 
shows a segment of the document containing part 
of speech annotation: 
90
<cesAna 
xmlns="htp:/ww.xces.org/schema/203" 
version="1.0.4"> 
<struct type="tok" from="4" to="6"> 
  <feat name="base" value="in"/> 
  <feat name="msd" value="IN"/> 
</struct> 
<struct type="tok" from="7" to="1"> 
  <feat name="msd" value="DT"/> 
  <feat name="base" value="this"/> 
</struct> 
<struct type="tok" from="12" to="19"> 
  <feat name="base" value="chapter"/> 
  <feat name="msd" value="N"/> 
</struct> 
.. 
</cesAna> 
Each <struct> element represents an edge in the 
graph; values of the from and to attributes denote 
the nodes (between characters in the primary text 
document) over which the edge spans. 
3.1 Annotating discontiguous spans 
Presently, the ANC includes standoff annotations 
that reference contiguous spans of data in the 
original (primary) document. However, we plan 
to add a wide variety of automatically-produced 
annotations for various linguistic phenomena to 
the ANC data, some of which wil reference dis-
contiguous regions of the primary data, or may 
reference annotations contained in other standoff 
documents. This is handled as folows: given an 
anotation graph, G, we create an edge graph G’ 
whose nodes can themselves be annotated, 
thereby allowing for edges between the edges of 
the original annotation graph G. 
For example, consider the sentence “My dog 
has fleas.” The standoff annotations for the to-
kens would be: 
          1 
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 
|M|y| |d|o|g| |h|a|s| |f|l|e|a|s| 
 
<struct … id="t1" from="0" to="2"/> 
<struct … id="t2" from="3" to="6"/> 
<struct … id="t3" from="7" to="10"/> 
<struct … id="t4" from="1" to="16"/> 
Now consider the dependency tree generated 
by Minipar
7
 given in Figure 2. The tree can be 
represented by annotating the token elements in 
the standoff annotation document as folows: 
 
<!- Define some pseudo nodes -> 
<node type="rot" id"E0" ref="t3"/> 
<node type="clone" id="E2" ref="t2"/> 
 
 
<!- Define edges in dependency tre -> 
<struct type="subj" id="s1" 
from="t3" to="E2"/> 
<struct type="s" id="s2" 
from="t3" to="t2"/> 
                                                
7
htp:/ww.cs.ualberta.ca/~lindek/minipar.htm 
<struct type="gen" id="gen" 
from="t2" to="t1"/> 
<struct type="obj" id="obj" 
from="t3" to="t4"/> 
 
Figure 2. Dependency tre generated by Minipar.
8
 
4 Creating In-line Anotation Docu-
ments 
We have developed an “XCES Parser”
9
 that im-
plements the org.xml.sax.XMLReader interface 
to create ANC documents containing in-line an-
notations in XML (or any other format). 
The XCES parser works as folows: annota-
tions to be loaded are selected with the 
org.xml.sax.XMLReader.setProperty() 
method. The selected annotation sets are then 
loaded into a single list in memory and sorted, 
first by ofset and, if the offsets are the same, 
secondly by annotation type. At present, the or-
dering of annotation types are hard coded into 
the parser; work is underway to make the XCES 
parser "schema aware" so that embedding speci-
fications can be provided by the user. Once the 
text is loaded and sorted, the appropriate SAX2 
events are generated and dispatched to the 
org.xml.sax.ContentHandler (if one has been reg-
istered with the parser) in sequence to simulate 
the parsing of an XML document. While the 
parser wil allow the programmer to specify an 
ErrorHandler, DTDHandler, or EntityResolver, 
at this time no methods from those interfaces wil 
be invoked during parsing. 
In the current version of the XCES parser, 
when overlapping annotations are encountered, 
they are "truncated". For example: 
<s>Sentence <em>one.</s><s>Sentence</em> 
two.</s> 
                                                
8
 Image generated by 
htp:/ai.stanford.edu/~rion/parsing/minipar_viz.html 
9
 htp:/americanationalcorpus.org/tols/index.html#xces-
parser 
91
becomes 
<s>Sentence <em>one.</em></s><s>Sentence 
two.</s> 
Work is underway to provide the option to gen-
erate milestones in CLIX/HORSE (DeRose, 
204) format to represent overlapping hierar-
chies. 
4.1 Using the XCES parser 
The XCES parser can be used in three ways: 
• from the command line.  The xces-
parser.jar file can be run as a command 
line program to print XML with inline an-
notation to standard output. 
• as the XML parser used by other applica-
tions.  For example, Saxon
10
 can take the 
name of the parser to use to parse the 
source document as a command line pa-
rameter.  This allows us to apply XSLT 
stylesheets to ANC documents without 
having to translate them into XML first. 
• as a library for use in other Java applica-
tions. For example, The ANC Tol
1
 is a 
graphical front end to the XCES parser. 
4.2 The ANC tool 
The ANC Tol provides a graphical user inter-
face for the XCES parser and is used to convert 
ANC documents to other formats. Users specify 
their choice of annotations to be included. Cur-
rently, the ANC Tol can be used to generate the 
folowing output formats: 
• XML XCES format, suitable for use with 
the BNC’s XAIRA
12
 search and access in-
terface; 
• Text with part of speech tags appended to 
each word and separated by an under-
score; 
• WordSmith/MonoConc Pro format. 
The ANC Tol uses multiple implementations 
of the org.xml.sax.DocumentHandler interface, 
one for each output format, which the XCES 
parser uses to generate the desired output. Adi-
tional output formats can be easily generated by 
implementing additional interfaces. 
Of course, if the target application understands 
annotation graphs, there is no need to bother with 
the XCES parser or conversion to XML. For ex-
ample, we provide several resources for GATE 
                                                
10
 htp:/saxon.sourceforge.net/ 
1
 htp:/ americanationalcorpus.org/tols/anctol.html 
12
 htp:/sourceforge.net/projects/xaira 
that permit GATE to open and read ANC docu-
ments with standoff anotations, or to load 
standoff annotations into an already loaded 
document. 
5 Future Work 
Currently the XCES parser is a prof of concept 
rather than a production grade tol. The parser is 
being augmented to invoke all the apropriate 
methods from the org.xml.sax.*Handler 
interfaces and throw the proper SAXExceptions 
at the appropriate times.  We are also providing 
for some level of SAX conformance, rather than 
simply “doing what Xerces does”. 
6 Conclusion 
The ANC has implemented an efficient pipeline 
for the processing of text into a corpus of ma-
chine usable documents. For some document 
types this process is almost completely auto-
mated and can be regarded as a Corpus-Builder-
in-a Box: raw data goes in one end, and a fuly 
annotated corpus with standoff annotations 
comes out the other. 
 The use of standoff annotations allows for an 
accurate representation of the ANC data as pro-
vided by the contributors and allows us to easily 
provide several modular annotation sets that can 
be included or excluded by the end user as de-
sired. By providing a SAX like parser for ANC 
documents, we are able to leverage a number of 
available XML tols without the restrictions im-
posed by an XML representation of the docu-
ments. For users who are not interested in XML 
or standoff annotations, the plain text version is 
preserved. 

References 
DeRose, Steven J. (204). Markup Overlap: A Re-
view and a Horse. htp:/ww.mulberytech.com/ 
Extreme/Procedings/html/204/DeRose01/ 
EML204DeRose01.html 
Ide, N., Romary, L. (204). International standard for 
a linguistic anotation framework. Journal of 
Natural Language Enginering, 10:3-4, 21-25. 
