Querying XML documents with multi-dimensional markup
Peter Siniakov
siniakov@inf.fu-berlin.de
Database and Information Systems Group, Freie Universit¨at Berlin
Takustr. 9, 14195 Berlin, Germany
Abstract
XML documents annotated by differ-
ent NLP tools accommodate multi-
dimensional markup in a single hier-
archy. To query such documents one
has to account for different possible
nesting structures of the annotations
and the original markup of a docu-
ment. We propose an expressive pat-
tern language with extended seman-
tics of the sequence pattern, support-
ing negation, permutation and regu-
lar patterns that is especially appropri-
ate for querying XML annotated docu-
ments with multi-dimensional markup.
The concept of fuzzy matching allows
matching of sequences that contain tex-
tual fragments and known XML ele-
ments independently of how concurrent
annotations and original markup are
merged. We extend the usual notion of
sequence as a sequence of siblings al-
lowing matching of sequence elements
on the different levels of nesting and
abstract so from the hierarchy of the
XML document. Extended sequence
semantics in combination with other
language patterns allows more power-
ful and expressive queries than queries
based on regular patterns.
1 Introduction
XML is widely used by NLP tools for anno-
tating texts. Different NLP tools can produce
overlapping annotations of text fragments.
While a common way to cope with concur-
rent markup is using stand-off markup (Witt,
2004) with XPointer references to the anno-
tated regions in the source document, another
solution is to consolidate the annotations in a
single document for easier processing. In that
case concurrent markup has to be merged and
accommodated in a single hierarchy. There are
many ways to merge the overlapping markup
so that different nesting structures are pos-
sible. Besides, the annotations have to be
merged with the original markup of the doc-
ument (e.g. in case of a HTML document).
The problem of merging overlapping markup
has been treated in (Siefkes, 2004) and we do
not consider it here. Instead we focus on the
problem of finding a universal querying mech-
anism for documents with multi-dimensional
markup. The query language should abstract
from the concrete merging algorithm for con-
current markup, that is to identify desired
elements and sequences of elements indepen-
dently from the concrete nesting structure.
The development of the query language was
motivated by an application in text mining.
In some text mining systems the linguistic
patterns that comprise text and XML anno-
tations (such as syntactic annotations, POS
tags) made by linguistic tools are matched
with semistructured texts to find desired infor-
mation. These texts can be HTML documents
that are enriched with linguistic information
by NLP tools and therefore contain multi-
dimensional markup. The linguistic annota-
tions are specified by XML elements that con-
tain the annotated text fragment as CDATA.
Due to the deliberate structure of the HTML
document the annotations can be nested in ar-
bitrary depth and vice versa – the linguistic
XML element can contain some HTML ele-
ments with nested text it refers to. To find
a linguistic pattern we have to abstract from
the concrete DTD and actual structure of the
XML document ignoring irrelevant markup,
which leads to some kind of “fuzzy” match-
ing. Hence it is sufficient to specify a sequence
43
of text fragments and known XML elements
(e.g. linguistic tags) without knowing by what
elements they are nested. During the match-
ing process the nesting markup will be omitted
even if the sequence elements are on different
nesting levels.
We propose an expressive pattern language
with the extended semantics of the sequence
pattern, permutation, negation and regular
patterns that is especially appropriate for
querying XML annotated documents. The
language provides a rich tool set for specify-
ing complex sequences of XML elements and
textual fragments. We ignore some important
aspects of a fully-fledged XML query language
such as construction of result sets, aggregate
functions or support of all XML Schema struc-
tures focusing instead on the semantics of the
language.
Some modern XML query languages impose
a relational view of data contained in the XML
document aiming at retrieval of sets of ele-
ments with certain properties. While these ap-
proaches are adequate for database-like XML
documents, they are less appropriate for doc-
uments in that XML is used rather for anno-
tation than for representation of data. Tak-
ing the rather textual view of a XML doc-
ument its querying can be regarded as find-
ing patterns that comprise XML elements and
textual content. One of the main differences
when querying annotated texts is that the
query typically captures parts of the docu-
ment that go beyond the boundaries of a sin-
gle element disrupting the XML tree structure
while querying a database-like document re-
turns its subtrees remaining within a scope of
an element. Castagna (Castagna, 2005) dis-
tinguishes path expressions that rather corre-
spond to the database view and regular ex-
pression patterns as complementary “extrac-
tion primitives” for XML data. Our approach
enhances the concept of regular expression
patterns making them mutually recursive and
matching across the element boundaries.
2 Related Work
After publishing the XML 1.0 recommenda-
tion the early proposals for XML query lan-
guages focused primarily on the representa-
tion of hierarchical dependencies between el-
ements and the expression of properties of a
single element. Typically, hierarchical rela-
tions are defined along parent/child and an-
cestor/descendant axis as done in XQL and
XPath. XQL (Robie, 1998) supports posi-
tional relations between the elements in a sib-
ling list. Sequences of elements can be queried
by “immediately precedes” and “precedes” op-
erators restricted on the siblings. Negation,
conjunction and disjunction are defined as fil-
tering functions specifying an element. XPath
1.0 (Clark and DeRose, 1999) is closely re-
lated addressing primarily the structural prop-
erties of an XML document by path expres-
sions. Similarly to XQL sequences are de-
fined on sibling lists. Working Draft for Xpath
2.0 (Berglund et al., September 2005) provides
support for more data types than its precur-
sor, especially for sequence types defining set
operations on them.
XML QL (Deutsch et al., 1999) follows the
relational paradigm for XML queries, intro-
duces variable binding to multiple nodes and
regular expressions describing element paths.
The queries are resolved using an XML graph
as the data model, which allows both ordered
and unordered node representation. XQuery
(Boag et al., 2003) shares with XML QL the
concept of variable bindings and the ability
to define recursive functions. XQuery fea-
tures more powerful iteration over elements
by FLWR expression borrowed from Quilt
(Chamberlin et al., 2001), string operations,
“if else” case differentiation and aggregate
functions. The demand for stronger support
of querying annotated texts led to the integra-
tion of the full-text search in the language (Re-
quirements, 2003) enabling full-text queries
across the element boundaries.
Hosoya and Pierce propose integration of
XML queries in a programming language
(Hosoya and Pierce, 2001) based on regular
patterns Kleene’s closure and union with the
“first-match” semantics. Pattern variables can
be declared and bound to the correspond-
ing XML nodes during the matching process.
A static type inference system for pattern vari-
ables is incorporated in XDuce (Hosoya and
Pierce, 2003) – a functional language for XML
processing. CDuce (Benzaken et al., 2003)
extends XDuce by an efficient matching al-
44
gorithm for regular patterns and first class
functions. A query language CQL based on
regular patterns of CDuce uses CDuce as a
query processor and allows efficient processing
of XQuery expressions (Benzaken et al., 2005).
The concept of fuzzy matching has been inro-
duced in query languages for IR (Carmel et
al., 2003) relaxing the notion of context of an
XML fragment.
3 Querying by pattern matching
The general purpose of querying XML doc-
uments is to identify and process their frag-
ments that satisfy certain criteria. We re-
duce the problem of querying XML to pat-
tern matching. The patterns specify the query
statement describing the desired properties of
XML fragments while the matching fragments
constitute the result of the query. Therefore
the pattern language serves as the query lan-
guage and its expressiveness is crucial for the
capabilities of the queries. The scope for the
query execution can be a collection of XML
documents, a single document or analogously
to XPath a subtree within a document with
the current context node as its root. Since in
the scope of the query there may be several
XML fragments matching the pattern, multi-
ple matches are treated according to the “all-
match” policy, i.e. all matching fragments are
included in the result set. The pattern lan-
guage does not currently support construc-
tion of new XML elements (however, it can be
extended adding corresponding syntactic con-
structs). The result of the query is therefore a
set of sequences of XML nodes from the doc-
ument. Single sequences represent the XML
fragments that match the query pattern. If no
XML fragments in the query scope match the
pattern, an empty result set is returned.
In the following sections the semantics, main
components and features of the pattern lan-
guage are introduced and illustrated by exam-
ples. The complete EBNF specification of the
language can be found on
http://page.mi.fu-berlin.de/~siniakov/patlan.
3.1 Extended sequence semantics
Query languages based on path expressions
usually return sets (or sequences) of elements
that are conform with the original hierarchical
structure of the document. In not uniformly
structured XML documents, though, the hi-
erarchical structure of the queried documents
is unknown. The elements we may want to
retrieve or their sequences can be arbitrarily
nested. When retrieving the specified elements
the nesting elements can be omitted disrupt-
ing the original hierarchical structure. Thus
a sequence of elements does no longer have to
be restricted to the sibling level and may be
extended to a sequence of elements following
each other on different levels of XML tree.
Figure 1: Selecting the sequence (NE ADV
V) from a chunk-parsed POS-tagged sentence.
XML nodes are labeled with preorder num-
bered OID|right bound (maximum descen-
dant OID)
To illustrate the semantics and features of
the language we will use the mentioned text
mining scenario. In this particular text mining
task some information in HTML documents
with textual data should be found. The doc-
uments contain linguistic annotations inserted
by POS tagger and syntactic chunk parser as
XML elements that include the annotated text
fragment as a text node. The XML output
of the NLP tools is merged with the HTML
markup so that various nestings are possible.
A common technique to identify the relevant
information is to match linguistic patterns de-
scribing it with the documents. The fragments
of the documents that match are likely to con-
tain relevant information. Hence the problem
is to identify the fragments that match our lin-
guistic patterns, that is, to answer the query
where the queried fragments are described by
linguistic patterns. Linguistic patterns com-
prise sequences of text fragments and XML el-
ements added by NLP tools and are specified
in our pattern language. When looking for lin-
guistic patterns in an annotated HTML docu-
45
ment, it cannot be predicted how the linguistic
elements are nested because nesting depends
on syntactic structure of a sentence, HTML
layout and the way both markups are merged.
Basically, the problem of unpredictable nest-
ing occurs in any document with a hetero-
geneous structure. Let us assume we would
search for a sequence of POS tags: NE ADV V
in a subtree of a HTML document depicted
in fig. 1. Some POS tags are chunked in
noun (NP), verb (VP) or prepositional phrases
(PP). Named entity “Nanosoft” is emphasized
in boldface and therefore nested by the HTML
element <b>. Due to the syntactic structure
and the HTML markup the elements NE, ADV
and V are on different nesting levels and not
children of the same element. According to
the extended sequence semantics we can ig-
nore the nesting elements we are not inter-
ested in (NPOID2 and bOID3 when matching
NE, VPOID8 when matching V) so that the se-
quence (NEOID4, ADVOID6, VOID9) matches
the sequence pattern NE ADV V, in short form
NE ADV V ∼= (NE4, ADV6, V9).
By the previous example we introduced
the matching relation ∼= as a binary relation
∼= ⊆ P × F where P is the set of patterns
and F a set of XML fragments. An XML frag-
ment f is a sequence of XML nodes n1 ...nn
that belong to the subtree of the context node
(i.e. the node whose subtree is queried, e.g.
document root). Each XML node in the sub-
tree is labeled by the pair OID|right bound.
OID is obtained assigning natural numbers
to the nodes during the preorder traversal.
Right bound is the maximum OID of a de-
scendant of the node – the OID of the right-
most leaf in the rightmost subtree. To match
a sequence pattern an XML fragment has to
fulfil four important requirements.
1. Consecutiveness: All elements of the se-
quence pattern have to match the consec-
utive parts of the XML fragment
2. Order maintenance: Its elements must be
in the “tree order”, i.e., the OIDs of the
nodes according to the preorder number-
ing schema must be in ascending order.
3. Absence of overlaps: No node in the se-
quence can be the predecessor of any
other node in the sequence on the way to
the root. E.g. NP PP NP negationslash∼= (NP11, PP18,
NP21) because PP18 is a predecessor of
NP21 and therefore subsumes it in its sub-
tree. The semantics of the sequence im-
plies that a sequence element cannot be
subsumed by the previous one but has
to follow it in another subtree. To de-
termine whether a node m is a predeces-
sor of the node n the OIDs of the nodes
are compared. The predecessor must have
a smaller OID according to the preorder
numbering scheme, however any node in
left subtrees of n has a smaller OID too.
Therefore the right bounds of the nodes
can be compared since the right bound of
a predecessor will be greater or equal to
the rightbound ofnwhile the rightbound
of any element in the left subtree will be
smaller:
pred(m,n) =
OID(m)<OID(n) ∧ rightBound(m)≥rightBound(n)
4. Completeness: XML fragment must not
contain any gaps, i.e. there should not be
a node that is not in the XML fragment,
not predecessor of one of the nodes, whose
OID however lies between the OIDs of
the fragment nodes. Since such a node
is not a predecessor, it must be an el-
ement of the sequence; otherwise it is
omitted and the sequence is not complete.
Hence, the pattern V NP NP negationslash∼= (V9, NP11,
NP21) because the node PR19 lying be-
tween NP11 and NP21 is not a predeces-
sor of any of the fragment nodes and not
an element of the fragment. If the nodes
lying between NP11 and NP21 cannot be
exactly specified, we can use wildcard pat-
tern (see sec. 3.3) to enable matching:
V NP * NP ∼= (V9, NP11, PR19, NP21):
Using these requirements we can formally
specify the semantics of the sequence:
Let s = s1 ...sk be a sequence pattern and
f = n1 ...nn the matching XML fragment.
s ∼= f ⇔
(I) s1∼=(n1...ni), s2∼=(ni+1...nj),...,sk∼=(nl...nn)
(II) ∀ 1≤i<n OID(ni)<OID(ni+1)
(III) negationslash∃ 1≤i<n pred(ni,ni+1)
(IV ) ∀ 1≤i<n negationslash∃ m OID(ni)<OID(m)<OID(ni+1)∧
∧¬pred(m,ni+1)
46
The fourth requirement stresses the impor-
tant aspect of “exhaustive” sequence: we are
interested in a certain sequence of known ele-
ments that can be arbitrarily nested and cap-
tured by some elements that are irrelevant
for our sequence (e.g. html layout elements
when searching for a sequence of linguistic el-
ements). We call such a sequence an exhaus-
tive non-sibling sequence (ENSS). It is exhaus-
tive because all predecessors omitted during
the matching are covered at some level by the
matching descendants so that there is no path
to a leaf of the predecessor subtree that leads
through an unmatched node. If such a path
existed, the fourth requirement would not be
met. If the sequence does not begin at the left-
most branch or does not end at the rightmost
branch of an omitted predecessor, the subtree
of the respective predecessor is not fully cov-
ered. In ADJ NN PR ∼= (ADJ14, NN16, PR19)
the omitted predecessors NP11 and PP18 are
not completely a part of the sequence because
they have descendants outside the sequence
borders. Nevertheless the sequence is exhaus-
tive since there is no path to a leaf through an
unmatched node within its borders.
Another important aspect of ENSS is that it
can match XML fragments across the element
borders. XPath imposes a query context by
specifying the path expression that usually ad-
dresses a certain element, XQuery restricts it
indirect by iterating over and binding variables
to certain nodes. Matching ENSS there is no
additional restriction of the query scope, that
is, the sequence can begin and end at any node
provided that the ENSS requirements are met.
The dashed line in the fig. 1 points up the re-
gion covered by the sample sequence.
According to the specification of the se-
quence pattern in the pattern language (cf.
appendix ??):
Pattern ::= Patternprime prime∗ Pattern
any pattern can be the element of the se-
quence. Therefore the sequence can also
contain textual elements, which is especially
important when processing annotated texts.
Textual nodes represent leaves in an XML tree
and are treated as other XML nodes so that
arbitrary combinations of XML elements and
text are possible: "released" NP "of" NE ∼=
(”released”10, NP11, ”of”20, NE22)
Exhaustive sequence allows a much greater
abstraction from the DTD of a document than
the usually used sequence of siblings. The ex-
pressiveness of the language significantly bene-
fits from the combination of backtracking pat-
terns (cf. sec. 3.3) with exhaustive sequence.
3.2 Specification of XML nodes
Patterns matching single XML nodes are the
primitives that the more complex patterns are
composed from. The pattern language sup-
ports matching for document, element, at-
tribute, text and CDATA nodes while some
DOM node types such as entities and process-
ing instructions are not supported. Some ba-
sic patterns matching element and text nodes
have been already used as sequence elements
in the previous section. Besides the simple ad-
dressing of an element by its name it is pos-
sible to specify the structure of its subtree:
Pattern ::=prime \primeXML-Tag(prime[primePatternprime]prime)?
A pattern specifying an element node will
match if the element has the name correspond-
ing to the XML-Tag and the pattern in the
square brackets matches the XML fragment
containing the sequence of its children. E.g.
\PP[ PR NE] ∼= (PP18) because the name of
the element is identical and PR NE ∼= (PR19,
NE22). As this example shows, the extended
sequence semantics applies also when the se-
quence is used as the inner pattern of another
pattern. Therefore the specification of ele-
ments can benefit from the ENSS because we
again do not have to know the exact structure
of their subtrees, e.g. their children, but can
specify the nodes we expect to occur in a cer-
tain order.
Attribute nodes can be accessed by ele-
ment pattern specifying the attribute values
as a constraint: \V {@normal="release"} ∼=
(V9), assumed that the element V9 has the
attribute ”normal” that stores the principal
form of its textual content. Besides equality
tests, numeric comparisons and boolean func-
tions on string attribute values can be used as
constraints.
Patterns specifying textual nodes comprise
quoted strings:
Pattern ::= QuotedString
and match a textual node of an XML element
if it has the same textual content as the quoted
string. Textual patterns can be used as ele-
47
ments of any other patterns as already demon-
strated in the previous section. An element
may be, for instance, described by a complex
sequence of text nodes combined with other
patterns: \sentence[NE * \V{@normal=release}
\NP[* "new" "version"] "of" NE *] ∼= (sentence1)
The pattern above can already be used as a
linguistic pattern identifying the release of a
new product version.
3.3 Backtracking patterns and
variables
In contrast to the database-like XML docu-
ments featuring very rigid and repetitive struc-
tures annotated texts are distinguished by a
very big structural variety. To handle this va-
riety one needs patterns that can cover several
different cases “at once”. So called backtrack-
ing patterns have this property and constitute
therefore a substantial part of the pattern lan-
guage. Their name comes from the fact that
during the matching process backtracking is
necessary to find a match.
The pattern language features complex and
primitive patterns. Complex patterns consist
of at least one inner element that is a pattern
itself. Primitive patterns are textual patterns
or XML attribute and element specifications
if the specification of the inner structure of
the element is omitted, e.g. "released", NP.
If at least one of the inner patterns does not
match, the matching of the complex pattern
fails. Backtracking patterns except for wild-
card pattern are complex patterns.
Let us assume, we look for a sequence
"released" NE and do not care what is be-
tween the two sequence elements. In the sub-
tree depicted in fig. 1 no XML fragment
will match because there are several nodes be-
tween ”released”10 and NE22 and the com-
pleteness requirement is not met. If we in-
clude the wildcard pattern in the sequence,
"released" * NE ∼= (”released”10 NP11 PR19
NE22), the wildcard pattern matches the
nodes lying between V9 and NE22. Thus, ev-
ery time we do not know what nodes can oc-
cur in a sequence or we are not interested in
the nodes in some parts of the sequence, we
can use wildcard pattern to specify the se-
quence without losing its completeness. Wild-
card pattern matches parts of the sequence
that are in turn sequences themselves. There-
fore it matches only those XML fragments that
fulfil the ENSS requirements II-IV. Since there
are often multiple possibilities to match a se-
quence on different levels, wildcard matches
nodes that are at the highest possible level
such as NP11 in the previous example.
If one does not know whether an XML frag-
ment occurs, but wants to account for both
cases the option pattern should be used:
Pattern ::=prime (primePatternprime)?prime
Pattern ::=prime (primePatternprime)∗prime
Kleene closure differs from the option by the
infinite number of repetitions. It matches a se-
quence of any number of times repeated XML
fragments that match the inner pattern of the
Kleene closure pattern. Since Kleene closure
matches sequences, the ENSS requirements
have to be met by matching XML fragments.
Let O = (p)? be an option, K = (p)∗ a Kleene
closure pattern, f ∈ F an XML fragment:
O ∼= f ⇔ p ∼= f ∨ {} ∼= f
K ∼= f ⇔ {} ∼= f ∨ p ∼= f ∨ p p ∼= f ∨ ...
where f fulfils ENSS requirements I-IV.
The option pattern matches either an empty
XML fragment or its inner pattern.
An alternative occurrence of two XML
fragments is covered by the union pattern:
Pattern ::=prime (primePattern(prime|primePattern)+prime)prime
Different order of nodes in the sequence can
be captured in the permutation pattern:
Pattern ::=prime (primePattern Pattern+prime)%prime
Let U = (p1|p2) be a union pattern,
P = (p1,...,pn)% a permutation pattern
U ∼= f ⇔ p1 ∼= f ∨ p2 ∼= f
P ∼= f ⇔ p1 p2...pn ∼=f ∨ p1 p2...pn pn−1 ∼=f ∨···
···∨ p1 pn...p2∼=f ∨ ···∨ pn pn−1...p2 p1∼=f
Permutation can not be expressed by regular
constructs and is therefore not a regular ex-
pression itself.
The backtracking patterns can be arbitrar-
ily combined to match complex XML frag-
ments. E.g. the pattern ((PP | PR)? NP)%
matches three XML fragments: (NP2), (NP11,
PP18) and (PR19, NP21). Using the backtrack-
ing patterns recursively enlarges the expres-
sivity of the patterns a lot allowing to specify
very complex and variable structures without
significant syntactic effort.
48
Variables can be assigned to any pattern
Pattern ::= Patternprime =:prime String
accomplishing two functions. Whenever a
variable is referenced within a pattern by the
reference pattern Pattern ::=prime $primeStringprime$prime
it evaluates to the pattern it
was assigned to. The pattern
(NP)∗=:noun_phrase * $noun_phrase$
∼= (NP2, ADV6, VP8, NP11) so that the
referenced pattern matches NP11. A pattern
referencing the variable v matches XML
fragments that match the pattern that has
been assigned to v. To make the matching
results more persistent and enable further
processing variables can be bound to the
XML fragment that matched the pattern the
variable is assigned to. After matching the
pattern \sentence[NE=:company *
\V{@normal=release} \NP[* "new" "version"]
"of" NE=:product *] ∼= (sentence1)
the variable company refers to NE4(Nanosoft)
and product is bound to NE22(NanoOS).
The relevant parts of XML fragment can
be accessed by variables after a match
has been found. Assigning variable to the
wildcard pattern can be used to extract
a subsequence between two known nodes:
"released" * =:direct_object "of" ∼=
(”released”10 NP11 ”of”20) with the variable
direct_object bound to NP11.
Let A = p =: v be an assignment pattern:
A ∼= f ⇔ p ∼= f
Matching backtracking patterns can involve
multiple matching variants of the same XML
fragment, which usually leads to different
variable bindings for each matching variant.
As opposed to multiple matchings when
different fragments match the same pattern
discussed above, the first-match policy is ap-
plied when the pattern ambiguously matches
a XML fragment. For instance,two different
matching variants are possible for the pattern
(NP)?:=noun_phrase (NP |PR)∗:=noun_prep
∼= (NP11, PR19). In the first case
(NP)?:=noun_phrase ∼= (NP11) so that
noun_phrase is bound to NP11 and
noun_prep to PR19. In the second
case (NP)?:=noun_phrase ∼= {} and
(NP | PR)∗:=noun_prep ∼= (NP11, PR19)
so that noun_phrase is bound to {} and
noun_prep to (NP11, PR19). In such cases
the first found match is returned as the final
result. The order of processing of single
patterns is determined by a convention.
3.4 Negation
When querying an XML document it is often
useful not only to specify what is expected but
also to specify what should not occur. This
is an efficient way to exclude some unwanted
XML fragments from the query result because
sometimes it is easier to characterize an XML
fragment by not wanted rather than desirable
properties. Regular languages (according to
Chomsky’s classification) are not capable of
representing that something should not appear
stating only what may or has to appear. In the
pattern language the absence of some XML
fragment can be specified by negation .
As opposed to most XML query languages
negation is a pattern and not a unary boolean
operator. Therefore it has no boolean value,
but matches the empty XML fragment.
Since the negation pattern specifies what
should not occur, it does not “consume” any
XML nodes during the matching process so
that we call it “non-substantial” negation.
The negation pattern !(p) matches the
empty XML fragment if its inner pattern
p does not occur in the current context
node. To underline the difference to logical
negation, consider the double negation. The
double negation !(!(p)) is not equivalent
to p, but matches an empty XML element
if !(p) matches the current context node,
which is only true if the current context
node is empty. Since the negation pattern
only specifies what should not occur, the
standalone usage of negation is not reason-
able. It should be used as an inner pattern of
other complex patterns. Specifying a sequence
VP *=:wildcard_1 !(PR) *=:wildcard_2 NP
we want to identify sequences starting with
VP and ending with NP where PR is not
within a sequence. Trying to find a match
for the sequence starting in VP8 and ending
in NP21 there are multiple matching variants
for wildcard patterns. Some of them enable
the matching of the negation pattern binding
PR to one of the wildcards, e.g. wildcard_1
is bound to (NP11, PR19), !(PR) ∼= {},
wildcard_2 is bound to {}. However, there
49
is a matching variant when the negated
pattern is matched with PR19 (wildcard_1
is bound to NP11, wildcard_2 is bound
to {}). We would certainly not want the
sequence (VP8, NP11, PR19, NP21) to match
our pattern because the occurrence of PR in
the sequence should be avoided. Therefore we
define the semantics of the negation so that
there is no matching variant that enables the
occurrence of negated pattern:
Let P1 !(p) P2 be a complex pattern compris-
ing negation as inner pattern. P1 and P2 are
the left and right syntactic parts of the pat-
tern and may be not valid patterns themselves
(e.g. because of unmatched parentheses). The
pattern obtained from the concatenation of
both parts P1 P2 is a valid pattern because it
is equivalent to the replacing of the negation
by an empty pattern.
P1 !(p) P2 ∼= f ⇔
P1 p P2 negationslash∼= f ∧P1 P2 ∼= f
Requiring P1 p P2 negationslash∼= f guarantees that no
matching variant exists in that the negated
pattern p occurs. Since !(p) matches an empty
fragment, the pattern P1P2 has to match com-
plete f. It is noteworthy that the negation is
the only pattern that influences the semantics
of a complex pattern as its inner pattern. In-
dependent of its complexity any pattern can
be negated allowing very fine-grained specifi-
cation of undesirable XML fragments.
4 Conclusion
XML documents with multi-dimensional
markup feature a heterogeneous structure
that depends on the algorithm for merging
of concurrent markup. We present a pattern
language that allows to abstract from the
concrete structure of a document and formu-
late powerful queries. The extended sequence
semantics allows matching of sequences across
element borders and on different levels of the
XML tree ignoring nesting levels irrelevant
for the query. The formal specification of
the sequence semantics guarantees that the
properties of “classic” sibling sequence such
as ordering, absence of gaps and overlaps
between the neighbors are maintained. The
combination of fully recursive backtracking
patterns with the ENSS semantics allows
complex queries reflecting the complicated
positional and hierarchical dependencies
of XML nodes within a multi-dimensional
markup. Negation enhances the expressivity
of the queries specifying an absence of a
pattern in a certain context.

References
V. Benzaken, G. Castagna, and A. Frisch. 2003.
CDuce: an XML-centric general-purpose language.
In In ICFP ’03, 8th ACM International Conference
on Functional Programming, pages 51–63.
V. Benzaken, G. Castagna, and C. Miachon. 2005.
A full pattern-based paradigm for XML query pro-
cessing. In Proceedings of the 7th Int. Symposium on
Practical Aspects of Decl. Languages, number 3350.
A. Berglund, S. Boag, D. Chamberlin, M. Fernndez,
M. Kay, J. Robie, and J. Simon. September 2005.
XML Path Language (XPath) 2.0. http://www.w3.
org/TR/2005/WD-xpath20-20050915/.
S. Boag, D. Chamberlin, M. Fernandez, D. Florescu,
J. Robie, J. Simon, and M. Stefanescu. 2003.
XQuery 1.0: An XML Query Language. http:
//www.w3c.org/TR/xquery.
David Carmel, Yoelle S. Maarek, Matan Mandelbrod,
Yosi Mass, and Aya Soffer. 2003. Searching XML
documents via XML fragments. In SIGIR ’03: Pro-
ceedings of the 26th annual int. ACM SIGIR confer-
ence, pages 151–158, New York,USA. ACM Press.
G. Castagna. 2005. Patterns and types for querying
XML. In DBPL - XSYM 2005 joint keynote talk.
Don Chamberlin, Jonathan Robie, and Daniela Flo-
rescu. 2001. Quilt: An XML query language for
heterogeneous data sources. LNCS, 1997:1–11.
J. Clark and S. DeRose. 1999. XML Path Language
(XPath). http://www.w3.org/TR/Xpath.
Alin Deutsch, Mary F. Fernandez, D. Florescu, A.Y.
Levy, and D. Suciu. 1999. A query language for
XML. Computer Networks, 31(11-16):1155–1169.
H. Hosoya and P.C. Pierce. 2001. Regular expression
patern matching for XML. In In POPL ’01, 25th
Symposium on Principles of Prog. Languages.
Haruo Hosoya and Benjamin C. Pierce. 2003. XDuce:
A statically typed XML processing language. ACM
Trans. Inter. Tech., 3(2):117–148.
XQuery and XPath Full-Text Require-
ments. 2003. http://www.w3.org/TR/2003/
WD-xquery-full-text-requirements-20030502/.
Jonathan Robie. 1998. The design of XQL. http:
//www.ibiblio.org/xql/xql-design.html.
C. Siefkes. 2004. A shallow algorithm for correcting
nesting errors and other well-formedness violations
in XML-like input. In Extreme Markup Languages.
A. Witt. 2004. Multiple hierarchies: new aspects of an
old solution. In Extreme Markup Languages 2004.
