Layout & Language: Preliminary experiments in assigning logical 
structure to table cells 
Matthew Hurst and Shona Douglas 
Language Technology Group, Human Communication Research Centre, 
University of Edinburgh, Edinburgh EH8 9LW UK 
{Matthew. Hurst, S. Douglas}~edinburgh. ac. uk 
Abstract 
We describe a prototype system for as- 
signing table cells to their proper place 
in the table's logical (relational) structure, 
based on a simple model of table struc- 
ture combined with a number of measures 
of cohesion between cell contents. Pre- 
liminary results suggest that very simple 
string-based cohesion measures are not suf- 
ficient for the extraction of relational in- 
formation, and that future work should 
pursue the aim of more knowledge/data- 
intensive approximations to a notional sub- 
type/supertype definition of the relation- 
ships between value and label cells. 
1 Introduction 
Real technical documents are full of text in tabu- 
lar and other complex layout formats. Most repre- 
sentations of tabular data are layout or geometry- 
based: in SGML, in particular, Marcy Thompson 
notes "table markup contains a great deal of infor- 
mation about what a table looks like.., but very lit- 
tle about how the table relates the entries .... \[This\] 
prevents me from doing automated context-based 
data retrieval or extraction." 1 
1.1 Views of tables 
In (Douglas, Hurst, and Quinn, 1995) an analysis 
of table layout and linguistic characteristics was of- 
fered which emphasised the potential importance of 
linguistic information about the contents of cells to 
the task of assigning a layout-oriented table repre- 
sentation to the logical relational structure it em- 
bodies. Two views of tables were distinguished: a 
denotational and a functional view. 
a(Thompson, 1996), p151. 
The denotation is the table viewed as a set of n- 
tuples, forming a relation between values drawn 
from n value-sets or domains. Domains typically 
consist of a set of values with a common supertype in 
some actual or notional Knowledge Representation 
scheme. The actual table may also include label 
cells which typically can be interpreted as a lexical- 
isation of the common supertype. We hypothesize 
that the contents of value cells and corresponding 
label cells for a given domain are significantly re- 
lated in respect of some measures of cohesion that 
we can identify. 
The functional view is a description of how the 
information presentation aspects of tables embody a 
decision structure (Wright, 1982) or reading path, 
which determines the order in which domains are 
accessed in building or looking up a tuple. 
To express a given table denotation according to 
a given functional view, there is a repertoire of lay- 
out patterns that express how domains can be 
grouped and ordered for reading in two dimensions. 
These layout patterns constitute a syntax of table 
structure, defining the basic geometric configura- 
tions that domain values and labels can appear in. 
1.2 An information extraction task 
Our application task is shallow information extrac- 
tion in construction industry specification docu- 
ments, containing many tables, which come to us 
via the miracles of OCR as formatted ASCII, e.g., in 
Figure 1. 
The predominant argument type of this genre of 
specification documents can be thought of as a form 
of 'assignment', similar to that in programming lan- 
guages. Our aim is to fit each assignment into a 
frame that contains various elements represented 
in terms of the sublanguage world model, a simple 
part-of/type-of knowledge representation. 
The elements we are looking for are entities, at- 
tributes which the KR accepts as appropriate for 
217 
I Maximuz total 
Mix I chloride ion content 
I (I\[ by height of cae~nt, 
J includin~ GGBS) 
I ................................................................ 
Prestre$1led concrete O. 1 
Concrlte made vith sulphate 
resleting Portland cement 0.2 
or supersulphated cent 
Concrete aada vith portland cement, 
Portland blastfurnaca cessna or 
coabination$ of GGB$ or PFA vith 0.4 
ordinary Portland cement and 
cont&Ini~ elbodded metal 
Figure 1: Example from the application domain 
t). w and d are either integers or the wild card ?, 
and specify respectively the x-extent and y-extent 
of an area of cells that can match the template; the 
wild card matches any width, or depth, as appropri- 
ate. t specifies whether the (sub)template is to be 
counted as a value (tv) or a label area (tl). 
A selection from a set of four possible restric- 
tions on a template can be defined: 
RESTRICTION 
-top 
-left 
+right 
+bottom 
AREA MUST 
not contain top row 
not contain leftmost column 
contain rightmost column 
contain bottom row 
The following templates are used currently: 
those entities, a unit or type for each attribute, a 
value which the assignment gives to each attribute, 
and a relationship expressing the semantic content 
of the assignment. To extract these components, we 
would like to have a basic representation of the tuple 
structure of the table, plus information about any la- 
bels and how they relate to the values, in order to 
specify fully the relationship and its arguments. 
1.3 Aims of the current work 
Without some way of identifying domains we cannot 
extract the table relation we require. Our aim is to 
investigate the usefulness of a range of cohesion mea- 
sures, from knowledge-independent to knowledge- 
intensive, in allowing us to select, from among those 
areas of table cells which are syntactically capable of 
being domains, those which in fact form the domains 
of the table. This paper reports the very beginning 
of the process. 
2 The current prototype system 
The system operates in two phases. In the first, a set 
of areas that might constitute domains is identified, 
using the constraints of table structure (geometric 
configuration) and cell cohesion. In the second, this 
candidate set is filtered to produce a consistent tiling 
over the table. 
2.1 A simplified table structure model 
The potential geometric configurations that we allow 
for a set of domain values (plus optional label) are 
called templates. A notation for specifying simple 
domain templates is defined as follows. 
A template is delimited by a pair of brackets 
\[... \]. Within the brackets is a list of sub-templates, 
currently recursive only to depth 1 and taken to be 
stacked vertically in the physical table. If a template 
has no sub-templates, it consists of a triple (ww, dd, 
lc: \[\[wl dl tl\] \[wl d? tv\]\] A label above a sin- 
gle column of values, of any height. 
lr: \[ \[wl dl tl\] \[w? dl tv\] \] A label above a single 
row of values, of any width. 
v: \[w? d? tv\]{-top -left +right +bottom} A 
rectangular area consisting of only values, re- 
stricted to domains at the bottom right margin, 
typically accessed using both x and y keys. 
c: \[wl d? tv\] A single column of values. 
2.2 A simplified cohesion model 
The 'goodness' of a rectangular area of the table, 
viewed as a possible instantiation of a given tem- 
plate, is given by its score on the various cohesion 
attributes. Values assigned for each of the chosen at- 
tributes are combined in a weighted sum to yield two 
overall cohesion scores for each MatchedArea, the 
value-value cohesion (v-v) and the label-value 
cohesion (l-v) as follows. 
We have a set of n v-v cohesion functions 
{f~r-v, fv-v.., fv-v} which each take two cells and 
return a value between 0 and 1 which reflects how 
similar the two cells are on that function, and a 
set of n weights #~ v-v o v-v w v-v} which deter- lwO , Wl • . . 
mine the relative importance of each function's re- 
sult. Then for any area A composed of a set of cells 
we can calculate a measure of the area's cohesion as 
a set of domain values: 
VS= ~ v-vScore(ci,cj) 
(ci,ci)EA 
(where (c~, cj) is an ordered pair of cells) 
n n 
v-vScore = / 2.-,X--"  ,v-v 
i=0 i=0 
218 
We have a set of m 1-v cohesion functions {flo-V, fl-v...fl'v } 
which each take two cells and 
return a value between 0 and 1 which reflects how 
likely one of the cells is to be a label for the other, 
- 1-v 1-v ..w~V} which and aset ofm weights ,tWo ,w 1 . 
determine the relative importance of each function's 
result. Then for an area A composed of a set of 
cells and a label cell ct we calculate a measure of the 
area's cohesion as a label plus set of domain values: 
LS = ~ 1-vScore 
(c~,cv):cvEA 
m m 
1-vScore = ~_wil-vA-v,li /Z..w ix'" l-v 
i----0 i=0 
A final score for the area is calculated as follows, 
depending on the type of template: 
If the area's template contains values and a label: 
finalScore = Wv-vVS + Wl_vLS Wv-v + Wl_ v 
where Wv-v and Wl_ v are weights reflecting the rela- 
tive importance given to the VS and LS respectively. 
If the area's template contains only values: 
finalScore = VS 
3 Experiments 
To test our system, we created a corpus of tables 
marked up in SGML with basic cell boundaries, al- 
lowing the template mechanism to determine the x 
and y position of cells. This representation is similar 
in relevant information content to many SGML table 
DTDS, and is also a plausible output from completely 
domain-independent techniques for table recognition 
in ASCII text or images, e.g., (Green and Krish- 
namoorthy, 1995). To this basic representation we 
added human-judgment information about the do- 
mains in each table (using an interface written in 
XEmacs lisp), specifying cell areas of values and la- 
bels for each domain. 
The tables were taken from a corpus of format- 
ted ASCII documents in the domain of construction 
industry specifications. 29 tables consisting of 91 
domains were open to examination during the ex- 
perimental development; 4 tables consisting of 13 
domains were held back as a test set. 
The tests we ran had different combinations of the 
cohesion measures alphanum and string-length 
with a factor ignorelabel, which corresponds to re- 
ducing the weighting wl-. for the goodness of the 
label match to 0. The unseen condition is the last 
(best-performing) combination, run on the held back 
data. 
The cohesion attributes reported here have values 
between 0 and 1, where 0 corresponds to high and 1 
to low similiarity: 
ALPHA-NUMERIC RATIO: Given by 
((laol- Ig~l labl - ~lgbl 
I'~,,1 + INa\[ \[Orb\[ + IlVbl )/L)"~ + O.5 
where laal is the number of alphabetic characters in 
string a and INal is the number of numeric characters 
in string a. 
STRING LENGTH RATIO: A nondirectional com- 
parison of string length. 
2.3 Selecting a set of MatchedAreas 
Given a set of templates, we find a set of MatchedAr- 
eas, rectangular areas of cells which satisfy a tem- 
plate definition and which reach a given cohesion 
threshold. The set of MatchedAreas contains no ar- 
eas that are wholly contained in other matched areas 
for the same template. 
From the set of MatchedAreas we select the ar- 
eas we believe to be the domains of the table using 
a greedy algorithm which selects a set of cells that 
form a complete, non-overlapping tiling over the ta- 
ble. 
4 Results and future work 
The recall results are given in Table 1. The experi- 
ment column specifies the trial in terms of the fac- 
tors defined above. The templates columns specify 
which templates are included in the trial. The re- 
call score for each trial is the number of matched 
areas that perfectly agree with the boundary and 
type of a domain as marked by the human judge, 
as a percentage of the number of domains identified 
by the human judge. (Since the selection algorithm 
produces only a single tiling for each table, precision 
was not explicitly measured.) 
4.1 Effect of templates 
Increasing the number of templates available at one 
time reduces the recall performance because of con- 
fusion during the selection process; if we used only 
the lc template, for instance, we would get better 
performance overall per domain (in this application 
area). The true performance of the system has to be 
judged with respect to the complete set (the right- 
most column in the results table), however, since all 
these templates are needed to match even quite sim- 
ple tables. 
219 
Experiment 
Templates available (lc~ (lr} iv} (c} (lc, (lc, (lc, (lr, ~lr, 
lr} ,} ~} v} c} 
alphanum 84 3 3 3 82 32 60 5 2 
stringlength 41 1 0 0 42 30 35 1 1 
alphanum, 
ignorelabel 84 3 3 3 84 34 84 5 2 
stringlength, 
ignorelabel 41 1 0 0 42 34 41 1 1 
alphanum, 
stringlength 75 2 3 3 75 45 68 4 1 
alphaaum, 
stringlength, 
ignorelabel 75 2 3 3 76 47 75 5 2 
unseen 77 8 0 0 77 62 62 8 8 
{v, (lc, (lc, (lc, (It, (lc, 
e} lr, lr, v, v, lr, c} c} c} 
c} 
7 33 59 24 9 26 
0 31 36 26 1 27 
7 36 84 34 9 36 
0 35 42 34 1 35 
7 45 67 42 8 42 
7 48 
0 62 
76 47 8 48 
62 54 0 54 
Table 1: Recall results for all experimental conditions: % of actual domains correctly identified 
The simple templates used here are also not ade- 
quate for more complex tables with patterns of reca- 
pitulation and multiply layered spanning labels. We 
intend to take a more sophisticated view of possi- 
ble geometric configurations, perhaps similar to the 
treatment in (Wang, 1996), and use the idea of read- 
ing paths to extract the tuples by relating the ap- 
propriate values from different domains. 
4.2 Effect of cohesion measures 
The alphanum and stringlength measures in combi- 
nation do perform rather better than alone. How- 
ever, ignoring l-v cohesion always improves recall; 
these cohesion measures do not help in distinguish- 
ing between labels and values, or in linking labels 
with value-sets. 
This will be more of a problem when we deal 
with more complex tables with complex multi-cell 
labels. In future, we intend to investigate the ef- 
fect of more sophisticated cohesion measures, includ- 
ing the use of thesaural information from domain- 
independent sources and corpus-based Knowlege Ac- 
quisition, e.g., (Mikheev and Finch, 1995), which 
should form better approximations to the super- 
type/subtype distinction. 
Combining a number of measures, in the kind 
of framework we have presented here, should allow 
graceful performance over a wide range of domains 
using as much information as is available, from what- 
ever source, as well as convenient evaluation of the 
relative contribution of different sources. 
Acknowledgements 
We acknowledge the support of BICC plc who sup- 
plied data and funded the first author during most 
220 
of this work, and of the Engineering and Phys- 
ical Sciences Research Council of the UK, who 
funded the second author under the CISA U project 
(IED4/1/5SlS). 

References 

Douglas, Shona, Matthew Hurst, and David Quinn. 
1995. Using natural language processing for identifying and interpreting tables in plain text. In 
Fourth Annual Symposium on Document Analysis and Information Retrieval, pages 535-545. 

Green, E. and M. Krishnamoorthy. 1995. Recognition of tables using tables grammars. In Proceedings of the Fourth Annual Symposium on Document Analysis and Information Retrieval, pages 
261-278, University of Nevada, Las Vegas, April. 

Mikheev, A. and S. Finch. 1995. A workbench for 
acquisition of ontological knowledge from natural 
text. In Proceedings of the 7th conference of the 
European Chapter for Computational Linguistics, 
pages 194-201, Dublin, Ireland. 

Thompson, Marcy. 1996. A tables manifesto. In 
Proceedings of SGML Europe, pages 151 - 153, 
Munich, Germany. 

Wang, Xinxin. 1996. Tabular Abstraction, Editing, 
and Formatting. Phd, University of Waterloo, On- 
tario, Canada. 

Wright, Patricia. 1982. A user-oriented approach to 
the design of tables and flowcharts. In David H. 
Jonassen, editor, The Technology of Text. Educa- 
tional Technology Publications, pages 317-340. 
