Layout and Language: Integrating Spatial and Linguistic 
Knowledge for Layout Understanding Tasks 
Matthew Hurst and Tetsuya Nasukawa 
IBM Research, Tokyo Research Laboratory 
Abstract 
Complex documents stored in a flat or partially 
marked up file format require layout sensitive pre- 
processing before any natural  processing 
can be carried out on their textual content. Con- 
temporary technology for the discovery of basic tex- 
tual units is based on either spatial or other con- 
tent insensitive methods. However, there are many 
cases where knowledge of both the  and lay- 
out is required in order to establish the boundaries 
of the basic textual blocks. This paper describes 
a number of these cases and proposes the applica- 
tion of a general method combining knowledge about 
 with knowledge about the spatial arrange- 
ment of text. We claim that the comprehensive un- 
derstanding of layout can only be achieved through 
the exploitation of layout knowledge and  
knowledge in an inter-dependent maimer. 
1 Introduction 
There is currently a significant amount of work be- 
ing carried out on applications which aim to de- 
duce layout information fl'om a spatial descrit/tion 
of a document. The tasks vary in detail, however 
they generally take as input a document description 
which presents areas of text (including titles, head- 
lags, paragraphs, lists and tables) marked implicitly 
by position. A simple example is a flat text docu- 
ment which uses white space to demonstrate align- 
meat at the edges of textual blocks and blank lines 
to indicate vertical spatial cohesion and separatiou 
between blocks. 1 
Rus and Summers ((Rus and Su,nmers, 1994)) 
state that "the non-teztual content of documents 
\[complement\] the tcztual content and should play art 
equal role". This is clearly desirable: textual and 
spatial properties, as described in tiffs paper, are 
inter-related and it is in fact highly beneficial to ex- 
ploit the relationships which exist between them. In 
1The term spatiM cohesion is motivated by the work on 
lexical cohesion by Morris and Hirst ((Morris and Hirst, 
1991)). Text which is cohesive is text which has a quality 
of unity (p. 21). Objects which have spatial cohesion have a 
quality of unity indicated by spatial features; in the words of 
Morris and Hirst: they "stick together". 
algorithmic terms, this implies implementing solu- 
tions which use both spatial and linguistic features 
to detect coherent textual objects its the raw text. 
Apt)roaches to tile problem are limited to those ex- 
ploiting spatial cohesion. There are two techuiques 
for achieving this. The first looks for features of 
space, identifying rivers of space which ruts around 
text blocks in some memfingflfl maimer. Tim second 
looks at non-linguistic qualities of the text includ- 
ing alignment of tokens between lines as well as cer- 
tain types of global interactions (e.g. (Kieninger and 
Dengel, 1998)). Although this second type focuses 
on the characters rather than the spaces in the text, 
tim features that it detects are implications of tile 
spatial arrangement of tile text: judging two words 
to be overlapping in the horizontal axis is not a fea- 
ture of tile words in terms of their content~ but of 
their position. Elements of the above basic methods 
may be combined and, as with any f'eatnre vector 
type of mmlysis, machine learning algorithins may 
be applied (e.g. (Ng el; al., 1999)). 
2 A New Method 
Tile methods based on spatial cohesion outlined 
above make assumptions about the application of 
layout to the textual content of the document in or- 
der to derive features indicating higher order struc- 
ture. These assumptions rely on tile realisation of 
layout as space and do not always hold (see, e.g., 
Figure 4: Grid Quantization), and may result in atn- 
biguities. However, there is another source of infor- 
mation which can be exploited to recover layout. 
Though layout imt)oses spatial etfects, it has lit- 
tle or no effect on low level  phenomena 
witlfin the distinct layout document objects: we do 
not ca'poet the layout of tea;t to render it ungr'ammat- 
ical? Conversely, we do not expect grmmnaticality 
to persist in an incorrect interpretation of layout. 
For example, applying this observation to the seg- 
mentation of a double colmnn of text will indicate 
2It is clear that layout does has very definite consequences 
for tile content of textual document elements, however those 
features we are concerned with here are below even this rudi- 
mentary level of  analysis. 
334 
the line breaks, see Figure 4: Double Cohunns. :3 The 
aI)l)lication of a low level hmguage model to the in- 
terpretation of spatially distinct textual areas can be 
applied in many cases where a tmrely spatial algo- 
rithm may fail. The following is an incomplete list 
of possible cases of application (concrete examples 
may be found in Figure 4): 
Multi Column Text When the cohmms are sepa- 
rated by only one space, a  model may 
be aI)l)lied to determine if and where the blocks 
exist. These m W be confused with False Space 
Positives where, by chance, the text tbnnat- 
ting introduces streams of white space within 
contiguous text. 
Apposed/Marginal Material Text which is off'- 
set from the main body of text, similarly to 
multi column text, will contain its own line 
breaks. 
Unmarked Headers Headers may be unmarked 
and appear silnilm' to single line 1)aragraphs. 
Double Spacing The introduction of more tlmn 
one line of spacing within contiguous text causes 
ambiguities with paragraph set)aration, headers 
alld so oi1. 
Elliptical Lists When text continues through a 
layout device, a  model may 1)e used 
to detect it. a 
Short Paragraphs When a t)aragral)hs is 1)artic- 
ularly short, the insertion of a line break may 
(:ause prot)lems. 
Another exmnple, and a usefifl at)plication, is that 
to the 1)rot)lore of table segmentation. Once a table 
has been located using this method or other meth- 
ods, the cells must be located. 
Multi-Cohunn Cells A cell sl)ans multit)le 
cohmms. This may easily 1)e conflised with 
Multi-Row Cells where a cell contains more 
than one line and must be groul)ed according 
to the line breaks. 
Elliptical Cell Contents Cells which tbrm a dis- 
junctiou of possible contilmations to the content 
of another cell can be identified using a  
model. 
Grid Quantization When a plain text table con- 
tains ceils which arc not wholly aligned with 
aIn Figure 4: Doubh,. Cohmms, we know, through 
the al)plicatlon of a  model, that there in a line 
t)reak after paragraph as a paragraph of text in more likely 
than a paragraph Applying, and Applying this of text 
is grammatically. 
4'l'his bares similm'ities with a simple list, lint the . 
is that of the textual lint; which uses flmctional words and 
lmnctuation to indicate disjmmtion. 
other cells in the stone grid row or column, it is 
difficult to associate the cells correctly. 
Languages which permit vertical and horizon- 
tal orthography (such as Japanese) pose additional 
t)roblems when extracting layout features from l)lain 
text data. 
Orientation Detection With mixed orientation, 
a  model may be used to distinguish 
vertical and horizontal text blocks) 
We can hyl)othesise that spatially cohesive areas 
of the document are renderings of some underlying 
textual representation. If, at some level, the text 
is set)arated from the layout (the text is linearised 
by removing line breaks), then we may observe cer- 
tain linguistic phenomena which are characteristic 
of the bmguage. Reversing this allows us to identify 
the sl)atially cohesive objects in the document t)y 
discovering the transfonnatioll to the text (the ap- 
plication of layout, i.e. the insertion of spacing and 
line breaks) which preserve our observations about 
the . One such observation is the ordering 
of words. Consequently, we can apply a  
model to a line of text in a docuinent to determine 
where line breaks have been inserted into the text 
for layout purt)oses by observing where the  
model breaks down and where our simt)le notion of 
layout 1)ased on sl)atial features i)ermits text block 
segmelttation. This is an ideal. In fact, knowledge 
of layout and lan.q'uagc is required to over'co'me th, e 
short comings of each,. 
There are many tyt)es of  model which 
may be applied to the problem being considered, 
ranging from the analytical - which provide an in- 
dication of linguisti(" structure), to tile classi\[ying - 
which indicate if (and to what extent) the intmt tits 
the model. The analytical, such as a context free 
grmnmar, are not appropriate for this problem as 
they require a broad intmt and are not suited to the 
fraglnents of" int)ut envisioned for this at)t)lications. 
The 1)rime purpose of the  model we wish 
to use is to t)rovide some ranking of candidate con- 
timmtions of a particular set of one or more tokens. 
A simple examI)le is the bigrmn model. This uses 
fl'equency counts of pairs of words derived froln a 
corlms. Although there are advantages and disad- 
vantages to this model, it will serve as an exmnI)le 
though other more Sol)histicated and reliable models 
inay easily be at)i)lied. 
5In Figure 4: Orientation Variation, the. column of text 
on the left of the tattle is a vertically orientated label (%W~m 
nlgeTk2L) whereas the remainder of the table is horizontally 
orientated. The apparent cohnnn oll the right of the tal)le is 
an artifact of the spacing and has no linguistic cohesion. 
335 
3 Basic Algorithm 
The problem can be generally described in the fol- 
lowing manner: given a set of objects distributed 
in a two dimensional grid, for each pair of objects, 
determine if they belong to tile same cohesive set. 
TILe objects are tokens, or words, and the measm'e 
of cohesion is that one word follows from the other 
in accordance with the the nature of the , 
the content of the document, and tlm idiom of the 
particular document element within which they may 
be contained and that the spatial model of the lay- 
out of the docmnent permits cohesion. In summary, 
the cohesion is spatial and linguistic. 
However, such a general description is not com- 
putationally sensible and the search space will be 
reduced if we consider the cases where we expect 
ambiguities to occur. This can be approached by 
recognising that when there is the potential for am- 
biguity there is often present some artifact, which 
tory well help identify the domain of the ambiguity: 
these are generally the markers of spatial cohesion; 
e.g., where there arc double cohnnns, we may also 
identify left justification. Consequently, for a given 
word in tile the double column area, tim mnbiguity 
may be resolved by inspecting tile word to the right, 
or tile set of words which may be left justified with 
the line currently under inspection on the line below. 
Therefore, tile application of tile  model to 
tile disambiguation problems mentioned above takes 
place between a small set of candidate continuation 
positions. 
These continuation points are located as pre- 
scribed by the markers of the spatial layout of text. 
Consequently, any algorithm using linguistic knowl- 
edge must exploit layout knowledge in order to 1)oth 
arrive at an economic sohltion, and also to be ro- 
bust to weaknesses in tile  model. The gen- 
eral method described here relies on and determines 
both spatial and linguistic information in a tightly 
integrated manner. Tile algorithm falls ill to the 
following broad steps: 
1. detect potential for ambiguity. 
2. compute the set of possible continuation points 
by using knowledge of spatial layout. 
3. disambiguate using a combination of hmguagc 
and layout knowledge. 
For examtfle, the words marked with a clear box 
in Figure 2, upper, are those which, according to a 
naive spatial algorithm, m'e possibly in close prox- 
imity to tile right edge of a text block. Hav- 
ing detected them, tile possit)le continuation points, 
shaded boxes, are comlmted (here for a single word 
for illustration). A  model may then be ap- 
plied to determine the most likely contimmtion. 
Care must be taken wlmn discovering equally 
likely continuations as opt)osed to a single most 
likely (me. Figure 2, lower, contains two examples. 
Tile first illustrates tile case when there is 11o con- 
tinuation appropriate (there are three equally likely 
continuations; as none is the most likely, no contin- 
nation should be proposed). In the second example, 
a unique continuation is preferred. The general al- 
gorithln above provides ammtation to the tokens in 
tile document which may then be used to drive a 
text-block recognition algorithm. 
Detecting the Potential for Ambiguity The 
potential for ambiguity occurs when a feature of the 
document is discovered which may indicate the im- 
mediate boundary of a text block. As we arc dealing 
with the basic element of a token (or word), the po- 
tential for ambiguity may occur at the end of a word, 
or between any two words in a sequence on tile line. 
However, we only need to consider those cases where 
a spatial algorithnl may determine a block boundary 
(correctly or incorrectly). In order to do this we need 
a characterisation of a spatial algorithln in terms of 
the features it uses to determine text block bound- 
aries.These are naturally related to space in the text, 
and so onr algorithm will be concerned with the fol- 
lowing three types of space: 1) Between words where 
there is a vertical river of white space which contin- 
ues above and below according to some threshold; 
2) Between words larger than a nfinimum amount 
of space; 3) At; tim right hand side of the document 
when no more tokens are found. These describe po- 
tential points for line break insertion into text and 
constitute a partial fllnctionat model of layout. 
Computing the Set; of Continuation Points 
The set of continuation points is comtmted accord- 
ing to the assumptions used to deterinine if there is 
the potential for ambiguity. The continuation point 
from a point of potential ambiguity are: 1) The next 
word to tile right; 2) The first word on tile next line; 
3) All tile continuation points on the next line which 
are to the left of the current word. These represent 
the complement to the above functional model of 
1wout. Thus we have a model of 1wout which is 
intentionally over general as it uses local features 
which are ambiguous. 
Disambiguation Disambiguation may be carried 
out in a number of ways depending on the extent re- 
quired by the  model being employed. How- 
ever, regardless of what range of history or looka- 
head is required by tile  model, the process 
of dismnbiguation is not a simple matter of selecting 
tile best possible continuation as proposed by the 
statistical or other elements of the  model. 
The interactions between layout and  re- 
quire that a nmnber of constraints be considered. 
These constraints model tile ambiguities cruised by 
336 
the layout and the . 
For any 1)otential point of anlbiguity, a single (or 
null) l)oint of continuation must be found. And for 
any l)oint of continuation, a single source of its his- 
tory is required. If token A has potential continua- 
tion points X and Y, and token B has potential con- 
tinuation points Y and Z, mid the best; continuation 
as predicted by the model for A is X and that tor B 
is also X, then both A mid B Call not be succeeded 
by their respective best continuations. The selection 
of continuation points nmst 1)e l)ased on the set of 
possible continuation points for the connected graph 
in which a potential point of ambiguity occurs (see 
Figure 3). An additional constraint inlposed by the 
1wout of the text is that links representing contin- 
uation cmmot cross. This constraint is a feature of 
the interaction between tile spatial layotlt all(1 the 
linguistic model. 
3.1 Extensions 
The above algorittnn is not callable, of capturing 
all types of continuation observed in the basic text; 
blocks of certain document elements. Specifically, 
there is an imi)licit restriction on a uni(lue continua- 
tion of the  through certain layout features. 
This may be called the one to one model of the inter- 
action t)etween layout and . I\]owever, the 
less fre(luent~ though equally inlt)ortant (:ases of one 
to many and many to one intera(:tions must also l)e 
considered. In Figure d: Many to One, exanll)les of 
1)oth are given. Significantly, these cases exists at 
the boundaries between t)asi(: textual COml)onents of 
large (loculnent ol).ie(:ts (here tables). It is suggested, 
the, n, that the detection of equally likely contimla- 
tion 1)oints may be used to dete, ct boml(larie, s where 
there is little or no sl)atial separation. (; 
3.2 Exi)erimentation 
Ill order to test the lmsic ideas described in this pa- 
1)er, a siml)le systenl was imt)lenlented. A (:orplts 
of documents was collected from the SEC archive 
(www.sec.gov). These docmnents are rich ill va.r- 
ious docunlent elenmnts inchlding tables, lists and 
headers. The documents are essentially flat, though 
there is solne anlount of header information encoded 
in XML as well as a nfinimal anlount of nmrkul) in 
the document; body. 
A simple 1)igram nlodel of the,  used was 
created. This was (:onstructed 1)artly from general 
texts (a corlms of English literature.) of which it was 
assumed there was no complex content, and 1)artly 
from tile SEC docunmnts. 7 A system was iml)le- 
°This begs a definition of equally likely - which would be, 
dependent on the  model and implementation. 
7An import;ant i)rocess in the creation of a  model 
for 1wout problems is the identification of usable  in 
the COl'pll8. ~\]~o these ends, the SEC (loculne.nl, s were marke(l 
up by hand to identiI~, i)aragrai)h text. These, text blocks 
mented which marked the potential points of mnbi- 
gully and tile continuation points and then at)plied 
the chlster and selection algorithln to determine the 
presence of spatio-linguistically cohesive text blocks 
(see example output ill Figure 1). 
As yet, no formal ewfluation of the implementa- 
tion is available. It can be asserted, however, that 
tile results obtained fl'om this preliminary implemen- 
tation indicate that the general method produces 
significant results, and that the basic notion of com- 
bining spatial and linguistic infornmtion tbr the de- 
ternfination of cohesive elements in a conlplex doe- 
unlent is a powerful one. 
Another experiment investigated tlle utility of the 
mettlods described in this paper. We wanted to de- 
termine how often mnbiguities occurred and how inl- 
1)ortant correct resolution was. Looking at; the am- 
biguity in table stub (:ells - tile mnlfiguity between 
multi-row ceils and multiple ceils below a header - 
resulted ill some significant results. For a sample of 
28 tables (1704 ceils); ill tile 131 stub cells we found 
68 examl)les of multi-row cells, and 35 of headers 
to multiple cells (note tlmt these are not disjoint 
sets). Using the SEC bigram model, the cases were 
disanll)iguated l)y hand, resulting in a 74 % success 
rate,. This sinlple investigation demonstrates that 
tim disalnbiguation is required and that linguistic 
inforination cm~ 1)e applied successflfily. 
4 Conclusions 
This l)aper has outlined a set of problents 1)articu- 
lar to the encoding of complex docmneng eh',ments 
in tlat or partially marked up files. The at)l)lit:a- 
tion of ~ siml)h',  nlodel in conjunction witll 
algorithms sensitive to the layout chal'acteristics of 
the docuulent elenlents ill terms of spatial ti;atures 
is in'oposed as a general solution to these problems. 
The, method relies on the, persistence of the  
ill which the document is written in tel'ms of the 
ulodel used to recognize it. 
ill the flltul'e, we intend to al)ply this approach to 
the implementation of a general layout analysis pre- 
processor. An interesting Dature of the interaction 
between tile  model and the 1wout of the 
document ix that the 1)erformance of a syst, enl ix (lilly 
sensitive to the quality of the  model at tile 
I)oints at wtfich it interacts with tile layout of tile 
docunlent. Consequently, a gelmral imrl)ose model 
built fronl a corpus of marked Ill) docmnents may 
be used to deternline a subset of the cohesive text- 
blocks ill a document. Those blocks may then be 
used to derive more  data, possil)ly specific 
to the documellt, and then tim process repeated until 
no nlore interactions are left ambiguous. 
were then used for the creation of a simple bigram model. 
337 
338 
I~eh~ic~l per sonne IEI 
~lli~, zener~l andE\] 
J~ Ho~th~ ended 
................ h 
\[ELI gOL\] \[Z<L\] CdEb 
\[55\] \[E~ EL\] ,k~ 
f'igure 1: Example portion of ()utt)ut from prototype system 
For example a paracNrapli occur. Applying ithis7 
of text is gram(ng-@i-dgI 'observettion to the\] 
wherever the line \[br6~aks} segmetltation of a double 
collm~n of text will indicaEe 
., - where the \].ine b~eaks 6ccur:.' 
Foriexample, a paragraph occur. Applying this 
of t'ext is \[gramrfiatical:. hlS~-r~ft:'~26i~ to the 
the line breaks seglnentation of a double 
column of text will indicate 
where the line breaks occur. 
!s0mdtime£ sentences may conspire to form ifais~ 
!positive s of rivers of white spaceiwl~ic~ 
appear itO separate blocks!. 
Sometimes.sentences may conspire to forra false 
ibbsitiV6s\]~ rivers of white space which 
a~_~___~ to separate blocks. 
b~umber ~6f Date \[Of 
\[13og~j \[-(~j ~I~_~ Name ~ Address 
t,'ignre 2: goc~tin4g Pote.ntial Ambiguity tlnd Computing Contiint~,tion Points 
If a higr~un model is use.d, the probal)ility that word ~,J is followed by word w' mary 
be expressed as; a probability as p(w' I w) and assigned a value between 0 and 1. If 
the probabilities are those shown in to the right then the continuation for A would 
be X and the contimtation point for B would be Y. 
p(XlA)=0.8 p(XlB)=0.5 
A B 
p(YIA)=0.4 p(YIB)=0.3 
Figure 3: Sorting continuation dei)end,; oil the potential layout of tile document 
339 
1 
p~ 
0 
0 
0 
0 ! 
0 o 
..n 
o 
Zi 
m 
@ 
o 
r~ 
o 
o~o~ ~o~ 
.~'~ 
Ng~g~ 
oz" 
0 
0 
om oo 
oe~ 
0 o 
~ o 
o 
~.~ o 
.o 
L~ 
0' 
c~ 
4~ 
o 
0 
o 
• ~.~ ,~ 
u • 
® .~ .~ 
~o.~ 
o 4~ 
0 0 0 
~o 
Y. 
~ H 
o ~),~ 
o 
o ~ .~ 
o 
o ~ ~ 
~.~ 
0 
~.~ ~ 
. 
0 
r~ 
,.~ 
o 
o o 
4~ 
oN 
,g 
~ i ~ ~ 
u ~ u u 
i , l 
0 
¢.) 
bO 
b.O 
p 
.# 
m b.o 
340 

References 

T. Kieninger and Andreas Dengel. 1998. A paper-to- 
html table converting system. In Proceedings of Doc- 
ument Analysis Systems (DAS) 98, Nagano, Japan, 
November. 

Jane Morris and Graeme Hirst. 1991. Lexical cohesion 
computed by thesaural relations as an indicator of the 
structure of text. Computational Li.quistics. 

Hwee Tou Ng, Chung Yong Lira, and Jessica Li Teng 
Koo. 1999. Learning to recognize tables in fi'ee text. 
In Proceedings of the 37th Annual Meeting oJ' the Asso- 
ciation for Computational Linguistics, pages 443-450, 
Maryland, USA, June. 

Daniela R~us ai~d Kristen Summers. 1994. Using white 
space for automated document structuring. Technical 
Report TR94-1452, Cornell University, Del)artmeilt of 
Computer Science, July. 
