Mining Tables from Large Scale HTML Texts 
Hsin-Hsi Chen, Shih-Chung Tsai and Jin-He Tsai 
Department o1' Computer Science and hfformation Engiueering 
National Taiwan University 
Taipei, TAIWAN, R.O.C. 
E-mail: hh_chen @csie.ntu.edu.tw 
Abstract 
Table is a very common presentation scheme, 
but few papers touch on table extraction in text 
data mining. This paper l'ocuscs on mining 
tables from large-scale HTML texts. Table 
filtering, recognition, interpretation, and 
presentation arc discussed. Heuristic rules and 
cell similarities arc employed to identify tables. 
The F-measure ot' table recognition is 86.50%. 
We also propose an algorithm to capture 
attribute-value relationships alnong table cells. 
Finally, more structured data is extracted and 
presented. 
Introduction 
Tables, which arc simple and easy to use, 
are very common presentation sclleme for 
writers to describe schedules, organize statistical 
data, summarize cxpcrilnental results, and so on, 
in texts ol' different domains. Because tables 
provide rich inlbrmation, table acquisition is 
useful for many applications such as document 
tmderstauding, question-and-answering, text 
retrieval, etc. However, most of previous 
approaches on text data mining focus on text 
parts, and only few touch on tabular ones 
(Appelt and Israel, 1997; Gaizauskas and Wilks, 
1998; Hurst, 1999a). Of the papers on table 
extractions (Douglas, Hurst and Quinn, 1995; 
Douglas and Hurst 1996; Hurst and Douglas, 
1997; Ng, Lim and Koo, 1999), plain texts arc 
their targets. 
I11 plain text, writers often use special 
symbols, e.g., tabs, blanks, dashes, etc., to inake 
tables. The following shows an example. It 
depicts book titles, authors, and prices. 
title author price 
Statistical Language Learning E.Chamiak $30 
Cross-Language Inforlnation P.elrieval G. Grefenstette $115 
NaturalLanguage Information Retrieval T.Slrzalkowski $144 
When detecting il' there is a table in free text, we 
should disambiguatc tile uses of tile special 
symbols. That is, the special symbol may be a 
separator or content o1' cells. Previous papers 
employ grammars (Green and Krishuainoorthy, 
1995), string-based cohesion measures (Hurst 
and Douglas, 1997), and learning methods (Ng, 
Lim and Koo, 1999) to deal with table 
recognition. 
Because of the silnplicity of table 
construction lnethods in free text, the expressive 
capability is limited. Comparatively, the 
markup s like HTML provide very 
flexible constructs for writers to design tables. 
The flexibility also shows that table extraction in 
HTML texts is harder than that iu plain text. 
Because the HTML texts are huge on the web, 
and they arc important sources o1' knowledge, it 
is indispensable to deal with table mining on 
HTML texts. Hurst (1999b) is the first attempt 
to collect a corpus froln HTML files, LAT~X 
files and a small number o1' ASCII files for table 
extraction. This paper focuses on HTML texts. 
We will discuss not only how to recognize tables 
from HTML texts, but also how to identify the 
roles of each cell (attribute and/or value), aud 
how to utilize the extracted tables. 
1 Tables in HTML 
HTML table begins with au optional 
caption t'ollowcd one or more rows. Each row 
is formed by one or more cells, which are 
classified into header and data cells. Cells 
can be merged across rows and colulnns. The 
following tags arc used: 
(1) <table...> </table> 
(2) <tr ...> </tr> 
(3) <td...> </td> 
(4) <th ...> </th> 
(5) <caption ...> </caption> 
166 
Table 1. All Example for a Tour Package ~ 
................ T~;t,r i~o'iic ................... } ......... isi'gi;)XR()iAii .......... } 
' ........... ;diiii~i ............... i 1999iii;fiOJ-2iJO0103131 i 
i ...... Ci,,,L0i.~x/o;\];\],,;i ............ { i,;ccgn\[;/,ii c (!izls Y ll.;x{c;isii;ii i 
il i, Siligiei(i;t;ii; i, 35,450 I 2510 ' 
i Adtilt i11 l)oublc Room .... :3:2;.5(J(J I i2)3i) I 
i II. (iixi;:{\[iGi i ....... 3i/556 ......... -7}6 i ........ >\[ _ < , I 
' !d Occupatioll i 25800 i i430 i 
Child ii~l ExU'aBed i 23,850 i '7\]0" i ....................... 
' i i t No Occt )a oi/I 22,900 360 i 
They denote main wrapper, table row, table data, 
table header, and caption for a table. Table 1 
shows an example that lists the prices for a tour. 
The interpretation of this table in terms of 
altribute-wdue relationships is shown as follows: 
Allribul¢ Vahtc 
Tour Code I)P91,AX01AI{ 
Valid 1999.04.01-2000.03.31 
Adult-l>ricc-Singlo Room-l~;conomic Class 35,450 
Adult-l'ricc-l)oublc l{oom-EconoMc Class 32,500 
Adult-l'ricc-Extra Ilcd-l:conomic Class 30,550 
Child-Pricc-OccutmtioiM :cononfic Class 25,800 
Child-t'rice-l';xlra Iled-l,;conomic Class 23,850 
Child-Price-No ()ccupalion-I.;conomic Class 22,900 
Adult-l'ricc-Single Room-l.:xlension 2,510 
Adtdt-Price-l)ouble l~oouM:,xlension 1,430 
Adtilt-lMcc-Fxtra Ilcd-Fxtcnsion 720 
Child l'ricc-()CCUl)ation-Fxicnsion 1,430 
Child-l'ricc-l';xh'a Bed-l';xtension 720 
Child Price-No ()ccupaiiolM,;xtcnsion 360 
Cell may play the role o1' attribute and/or value. 
Several cells may be concatenated to denote an 
altribute. For example, "AdulI-Price-Single 
\]),ecru-Economic Chlss" means the ;tdl.llt price 
for economic class and single room. The 
relationships may 13o read in column wise or in 
row wise depending on the interpretation. For 
example, the relationship for "Tour 
Code:I)P9LAXOIAB" is in row wise. The 
prices for "Economic Class" are in column wise. 
The table wrapper (<table> ... </table>) is 
a useful cue lkw table recognition. The H'FMI, 
text for the above example is shown as follows. 
The table tags are enclosed by a table wrapper. 
<lablc border> 
<if> 
<td COI~SIL,\N="3">Totu" Code</td> 
<ld COI,SI'AN="2">I)I'91,AXO1AB</Id> 
</tr> 
<11> 
<id COLS PAN="3">Valid</id> 
<ld C, OLS PAN="2"> 1999.04.01-2000.03.31 </td> 
</I r> 
<lr> 
' This example is selected from http://www.china- 
airlincs.com/cdl~ks/los7.-4.htm 
<td COI,Sl'AN:"3">Class/I.~xlensic, n</td> 
<td>l':cononlic Class</td> 
<td>l';xlcnsion</td> 
</ir> 
<I1> 
<td ROW SPA N="3">Adult</td> 
<ld ROW.q PA N="6"><I)> P</l)> 
<l)>P,</p> 
<p>l</p> 
<p>C</p> 
<p>l:</td> 
<ld>Single Rooni</td> 
<ld>35,450</Id> 
<1d>2,510</td> 
</tr> 
<tr> 
<ld>l)oubl¢ I~,oom</td> 
<ld>32,500</Id> 
<ld> 1,430</td> 
<ttr> 
<h> 
<td>l:;xlra Hcd</td> 
<td>30,550</kl> 
<td>720</td> 
</tr> 
<Jr> 
<td>Chikl</id> 
<td>()ccupation<</td> 
<td>25,800</td> 
<td> 1,430</td> 
<It r> 
<11> 
<td>l ';xtra Ik'd</tcl> 
<td>23,850</td> 
<td>720</id> 
<It r> 
<11> 
<td>No ()CCUlmtion</td> 
<td>22,900</td> 
<kl>360</td> 
</It> 
<:/lalq,_.> 
ltowever, ;l taMe does not always exist when 
table wrapper al3pears in ft'I'MI, text. This is 
because writers often employ table tags to 
i'cpresent forlll or IllOlltl. That allows users to 
input queries or make selections. 
Another fx)int that shoukt be mentioned is: 
table designers usually employ COLSPAN 
(ROWSPAN) to specify how many cohunns 
(rows) a table cell should span. In this example, 
the COI,SPAN of cell "Tour Code" is 3. That 
means "Tour Code" spans 3 columns. 
Similarly, the P, OWSI~AN o1' cell "Adult" is 3. 
This cell spans 3 rows. COLSPAN and 
ROWSPAN provide flexibility for users to 
design any kinds ot' tables, but they make 
automatic table interpretation more 
challengeable. 
167 
2 Flow of Table Mining 
The flow of table nfining is shown as 
Figure 1. It is composed of five modules. 
Hypertext processing module analyses HTML 
text, and extracts the table tags. Table filtering 
module filters out impossible cases by heuristic 
rules. The remaining candidates are sent to 
table recognition module for further analyses. 
The table interpretation module differentiates 
the roles of cells in the tables. The final 
module tackles how to present and employ the 
mining results. The first two modules are 
discussed in tile following paragraph, and the 
last three modules will be dealt with in the 
following sections in detail. 
table 
recognition 
It 
table 
interpretation 
hypertext 
plocessin~ 
It 
table 
filtering 
presentation 
of results 
Figure 1. Flow of Table Mining 
As specified above, table wrappers do not 
always introduce tables. Two filtering rules are 
employed to disambiguate their functions: 
(1) A table must contain at least two cells 
to represent attribute and value. In other words, 
the structure with only one cell is filtered out. 
(2) If the content enclosed by table 
wrappers contain too much hyperlinks, forms 
and figures, then it is not regarded as a table. 
To evaluate the performance of table 
mining, we prepare the test data selected from 
airline information in travelling category o1' 
Chinese Yahoo web site (http://www.yaboo.com. 
tw). Table 2 shows the statistics of our test 
data. 
AMines 
Number of 
Pages 
# of 
Wrappers 
Number of 
Tables 
Table 2. Statistics o1' Test Data 
China 
AMine 
694 
2075 
751 
Eva 
Airline 
366 
568 
98 
Mandarin Singapore Fareast Sum 
AMine AMine Ml'line 
142 110 60 1372 
184 163 228 3218 
(2.35) 
23 40 6 918 
(0.67) 
Table 3. Pertbrmance of Filtering Rules 
China Eva Mandarin Singapore Fareast Sum 
Airline Airline AMine Airline Airline 
#of 2075 568 184 163 228 3218 
wrappers 
Number of 751 98 23 40 6 918 
Tables 
Number of 1324 470 161 123 222 2300 
\[Non-Tables 
Total 973 455 158 78 213 1877 
Filter 
Wrong 15 0 0 3 2 20 
Filter 
Correct 98.46% 100% 100% 96.15% ~)9.06% )8.93% 
Rate 
These four rows list tile names of aMines, total 
number of web pages, total number of table 
wrappers, and total number of tables, 
respectively. On the average, there are 2.35 
table wrappers, and 0.67 tables for each web 
page. The statistics shows that table tags are 
used quite often in HTML text, and only 28.53% 
are actual tables. Table 3 shows the results 
after we employ the filtering rules on the test 
data. Tile 5 th row shows how many non-table 
candidates are filtered out by the proposed rules, 
and tile 6 th row shows the nulnbcr of wrong 
filters. On the average, the correct rate is 
98.93%. Total 423 of 2300 nou-tables are 
remained. 
3 Table Recognition 
After simple analyses specified in the 
previous sectiou, there are still 423 non-tables 
passing the filtering criteria. Now we consider 
the content of the cells. A cell is much shorter 
than a senteuce in plain text. In our study, the 
length of 43,591 cells (of 61,770 cells) is smaller 
than 10 characters 2. Because of the space 
lilnitation in a table, writers often use shorthand 
notations to describe their intention. For 
a A Chinese character is represented by two bytes. 
That is, a cell contains 5 Chinese characters oil the 
average. 
168 
example, they may use a Chinese character (":~,\]", 
dao4) to represent a two-character word "~j~" 
(dao4da2, arrive), and a character ("¢~", 1i2) to 
denote the Chinese word ",~$ i~,~l" (li2kail, leave). 
They even employ special symbols like • and 
Y to represent "increase" and "decrease". 
Thus it is hard to determine if a fragment of 
ttTML text is a table depending on a cell only. 
The context among cells is important. 
Value cells under the same attribute names 
demonstrate similar concepts. WE employ the 
following metrics to measure the cell similarity. 
(1) String similarity 
We measure how many characters are 
common in neighboring cells. I1' the 
lmmber is above zt threshold, we call 
lhe two cells are similar. 
(2) Named entity simihuily 
The metric considers semantics of cells. 
We adopt some named entity 
expressions defined in MUG (1998) 
such as date/time expressions and 
monetary and percentage expressions. 
A role-based lnethod similar {o lhe 
paper (Chert, Ding, and Tsai, 1998) is 
employed to tell if a cell is a specific 
named entity. The neighboring cells 
belonging to the same llalned entity 
category are similar. 
(3) Number category similarily 
Number characters (0-9) appear very 
often. If total number characters in a 
cell exceeds a threshold, we call tlae 
cell belongs to !.he number category. 
The neighboring cells in number 
category are similar. 
We count how many neighboring cells are 
similar. If the percentage is above a threshold, 
the table tags are interpreted as a table. The 
data after table filtering (Section 2) is used to 
evaluate the strategies in table recognition. 
'Fables 4-6 show the experimental results when 
the three metrics are applied incrementally. 
Precision rate (P), recall rate (R), and 
F-measure (F) defined below are adopted to 
measure the performance. 
p = NumberQ/Correct7?tl)lesSystemGenerated 
TotalNumberO/TahlesSystem Gen crated 
R = NumberOJ'CorrectTahlexSystemGenerated 
7btalNumberOfCorrectT"ables 
P+R 
2 
Table 4 shows that string similarity cannot 
capture the similar concept between neighboring 
cells very well. The F-measure is 55.50%. 
Table 5 tries to incorporate more semantic 
features, i.e., categories of named entity. 
Unlbrtunately, the result does not meet our 
expectation. The performance only increases a 
little. The major reason is that the keywords 
(pro/am, $, %, etc.) for date/time expressions 
and monetary and percentage expressions are 
usually omitted in {able description. Table 6 
shows that the F-measure achieves 86.50% 
when number category is used. Compared wilh 
Tables 4 and 5, the performance is improved 
Table 4. String Similarity 
China l';;'a 
Airline Airline 
Numhcr of 751 98 
Tables 
Tables 150 4 I 
Proposed 
Correct 134 39 
l'rccision 89.33% 95.12% 
Ralc 
Recall Ralc 17.8'l% 39.80% 
l:-mcasurc 53.57% 67.46% 
Mandarin Singapore l"areast Nttm 
AMine AMine Airline 
23 4O 6 918 
7 17 5 220 
7 14 3 197 
lOOq~ 82.35% 6(Y/, 89.55c/~ 
30.43% 35.00% 50% 21.46% 
65.22~A 58.68% 55% 55.50% 
Table 5. String or Named Entity Similarity 
China l';wL Mandarin Singapore Farcasl Sum 
Airline Airline AMine Airline Airline 
Number of 751 98 23 40 6 918 
Tables 
Tables 151 42 7 17 5 222 
Proposed 
Correct 135 40 7 14 3 199 
Precision 89.40% 95.24% 100% 82.35% 60% 89.64% 
Rate 
Recall Rate 17.98% 40.82% 30.43% 35.00% 50% 21.68% 
F-measure 153.69% 68.03% 65.22% 58.68% 55% 55.66% 
Table 6. String, Named Entity, 
or Nulnber Category Similarity 
China 10,,a Mandarin Singapore Fai'cast Stllll 
AMine AMine Airline AMine AMine 
Nmnbm" of 751 98 23 40 6 918 
Tables 
Tables 668 60 16 41 6 791 
l'roposcd 
Correct 627 58 14 32 4 735 
Precision 93.86% 96.67% 87.50% 78.05% 66.67% 92.92% 
Rate 
P, ccall P, alc 83.49% 59.18% 60.87% 80.00% 66.67% 80.07% 
F-measure 88.88% 77.93(/,, 74.19% 79.03% 66.67% 86.50% 
169 
drastically. 
4 Table Interpretation 
As specified in Section 1, the 
attribute-value relationship may be interpreted in 
colunm wise or in row wise. If the table tags in 
questions do not contain COLSPAN 
(ROWSPAN), the problem is easier. The first 
row and/or the first column consist of the 
attribute cells, and the others are value cells. 
Cell similarity guides us how to read a table. 
We define row (or column) similarity in terms of 
cell similarity as follows. Two rows (or 
columns) are similar il' most of the 
corresponding cells between these two rows (or 
columns) are similar. 
A basic table interpretation algorithm is 
shown below. Assume there are n rows and m 
Let % denote a cell in i m row and jth colulnns. 
col u mn. 
(1) I1' there is only one row or column, 
then the problem is trivial. We jnst 
read it in row wise or column wise. 
(2) Otherwise, we start the similarity 
checking froln the right-bottom 
position, i.e., c,~,n. That is, the n th 
row and the in th column arc regarded 
as base for comparisons. 
(3) For each row i (1 _< i < n), compute 
the similarity of the two rows i and n. 
(4) Count how many pairs of rows are 
similar. 
(5) If the count is larger than (n-2)/2, and 
the similarity of row 1 and row n is 
smaller than the similarity of the other 
row pairs, then we say this table can 
be read in column wise. In other 
words, the first row contains attribute 
cells. 
(6) The interpretation from row wise is 
done in the similar way. We start 
checking from in th coluInn, compare it 
with each column j (1 < j < in), and 
count how many pail's of columns are 
similar. 
(7) If neither "row-wise" nor 
"column-wise" can be assigned, then 
the default is set to "row wise". 
Table 6 is an example. The first column 
contains attribute ceils. The other cells arc 
statistics of an expel'imel~tal result. We read it 
in row wise. ff COLSPAN (ROWSPAN) is 
used, the table iutet'pretation is more difficult. 
Table 1 is a typical example. Five COLSPANs 
and two ROWSPANs are used to create a better 
layout. The attributes are formed 
hierarchically. The following is an example of 
hierarchy. 
Adult ..... I'rice ........... Double Room 
............. Single Room 
............ Extra Bed 
Here, we extend the above algorithm to 
deal with table interpretation with COLSPAN 
(ROWSPAN). At first, we drop COLSPAN 
and ROWSPAN by duplicating several copies o1' 
cells in their proper positions. For example, 
COLSPAN=3 for "Tour Code" in Table 1, thus 
we duplicate "Tour Code" at colunms 2 and 3. 
Table 7 shows the final reformulation el' the 
example in Table 1. Then we employ the 
above algorithln with slight inodification to l'ind 
the reading direction. 
The modification is that spanning cells ale 
boundaries for similarity checking. Take Table 
7 as an example. We start the similarity 
checking from the i'ight-I~ottom cell, i.e., 360, 
and consider each row and column within 
boundaries. The cell "1999.04.01- 2000.03.31" 
is a spanning cell, so that 2 "a row is a boundary. 
"Price" is a spanning cell, thus 2 '''1 column is a 
boundary. In this case, we can interpret the 
table tags in both row wise and column wise. 
Table 7. Reformulation of Example in Table 1 
\] our t Tot ' Co( e Code ~I'F°t "Co¢ e I+DP9LAX0 AB DP9LAX01ABi 
,. ....... i 1999.04.01- 1999.04.01- vmd { Vand 1 Vanu t 
' t ' ~ ' I 2000.03.31 2000.03.31 
......... "C5~7;') " " U~i~{g21++f++~Sii~i;~U+ l++Eco;;o;iiic++V2. ........ ~ ............. ! extenmon 
Extension {ExtensionlExtension t Class ! " ' 
.......................... : ........................................................................................................................... t " ~ \[ , Single Adult PIxICE I 35450 2,510 
f + ! ~'°°m ! ' 
Double Adult PRICE I-" 32 500 1,430 
I ~ I Room | " '" 
...... ++el +++ +V+i++~ + ++(+i ~i o++ ++`++++;i+++ + ........... +++ ........ 
.... +++ +i+++++ + ++a +<+++ ++ + ............ +++0 ......... 
+ + | 
Child | I'RICE + , . 22 900 360 
170 
After that, a second cycle begins. The 
starling points are moved lo new right-bottom 
positkms, i.e., (3, 5) and (9, 3). In this cycle, 
boundaries are reset. The cells 
I)P9LAX01AIF' and "Adtll\[" ("(~.hild") itlC 
spalltlil\]g coils, so that I st row alld is| colun\]n {ll*C 
|low bouildarios. At this \[illle, "to\v-\vise" i:; 
selected. 
In final cycle, Ihc starting positions are (2,5) 
and (9, 2). The boundaries arc 0 'l' rOW and ()u~ 
column. Those two siib-tables are road it\] row 
wise. 
5 Presentalion of Table Extraction 
The results of table interprctatioll arc a 
sequence of attributc-wfluc pairs. Consider the 
tour example. Table 8 shows the extracted 
pairs. We can find ihe following two 
phenomena: 
(I) A cell may be a vahle of lliOre \[h~tll ()tic 
attribute. 
(2) A cell may ael as an attribute in one 
case, and a value in another case. 
We can concatenate two attributes logelher by 
using phenomenon (1). l;or example, "35,450" 
is a value of "Single Room" and "Economic 
Class", thus "Single Room-Econonfic Class" is 
formed. Besides l\[lal, we Call find attribute 
hierarchy by using l)hcnomcnon (2). For 
example, "Single 1),oom" is a value o1" "Price", 
and "Price" is a vahie of "Adult", so that we can 
create a hierarchy "Adult-Price-Single Room". 
Merging the results from these two 
phononlena, we can create the in/erl~rclations 
that we listed in Section 1. For example, from 
the two facts: 
"35,450" is a wflue of "Single Room-L;conomic 
Class", and 
"Adult-Price-Single Room" is a hierarchical 
attribute, 
we can infer that 35,450 is a vahie o1' 
"Adult-Price-Single Rooin-Economic Class". 
In this way, we can transform unstructured 
data into more slrtictured representatioil for 
fttrther applications. Consider an application in 
quest|O|\] al\]d answering. Giver a query like 
"how much is the piice of a double |oom for all 
adult", the keywoMs are "price", "double 
Table 8 Tim Extracted Attril)ute-Value Pairs 
l ~' cycle 
2 '"1 t')'t'|e 
3 'a t'ydc 
Altribulc Value 
Single P, oonl 35¢150 
Single I{cx:,nl 2,510 
I )Otlblc l(ooin 
I )Otlble P, ooln 
32,500 
1,430 
No Occul)atioll 22,900 
No Occultation 360 
I'conomic Class 35,450 
Economic Class 32,500 
Ec~monfic Class 22,900 
I:xlcnsion 2,510 
I'xtension 1,430 
I-xtension 
Class/I,;xtension 
360 
Economic Class 
Class/l';xicnsion l:~xtension 
Valid 1999.04.01-2000.03.31 
Price Single Room 
Price Double ROOlll 
I'RICI:, No ()cctqmtion 
Tour Code I)l'9t ,AX01ANB 
((( Valid 199 ~.()4.01-2000.03.31 
Adul! Price 
Child Price 
room", and "adult". After consulting the 
database learning from HTMI. lexls, two wflues, 
32,500 and 1,430 with attributes economic class 
and extension, are reported. With this table 
mining technology, knowledge lhat can be 
employed is beyond text level. 
Conclusion 
in this paper, we propose a systematic way 
to mine tables from HTML texts. Table 
filtering, table recognition, table interpretation 
and application of table extraction are discussed. 
The cues l'ron\] HTML lags and information in 
taMe cells are employed to recognize and 
interpret tables. The F-measure for table 
171 
recognition is 86.50%. 
There are still other spaces to improve 
performance. The cues from context of tables 
and the traversal paths of HTML pages may be 
also useful. In the text surrounding tables, 
writers usually explain the meaning of tables. 
For example, which row (or column) denotes 
what kind ol' meanings. From the description, 
we can know which cell may be an attribute, and 
along the same row (column) we can find their 
value cells. Besides that, the text can also 
show the selnantics ot' the cells. For exalnple, 
the table cell may be a monetary expression that 
denotes the price of a tour package. In this 
way, even money marker is not present in the 
table cell, we can still know it is a monetary 
expression. 
Note that HTML texts can be chained 
through hyperlinks like "previous" and "next". 
The context can be expanded further. Their 
effects on table mining will be studied in the 
future. Besides the possible extensions, 
another research line that can be considered is to 
set up a corpus for evaluation o1' attribute-value 
relationship. Because the role of a cell 
(attribute or value) is relative to other cells, to 
develop answering keys is indispensable for 
table interpretation. 

References 

Appelt, D. and Israel, D. (1997) "Tutorial Notes on 
Building Information Extraction Syslems," Tutorial 
on Fifth Conference on Applied Natural Language 
Processing, 1997. 

Chen, H.H.; Ding Y.W.; and Tsai, S.C. (1998) 
"Named Entity Extraction for Information 
Retrieval," Computer Processing of Oriental 
Languages, Special Issue on Information Retrieval 
on Oriental Languages, Vol. 12, No. 1, 1998, 
pp.75-85. 

Douglas, S.; Hurst, M. and Qui,m, D. (1995) "Using 
Natural Language Processing for Identifying and 
Interpreting Tables in Plain Text," Proceedings of 
Fourth Annual Symposium on Document Analysis 
and Informatiotl Retrieval, 1995, pp. 535-545. 

Douglas, S. and Hurst, M. (1996) "Layout and 
Language: Lists and Tables in Technical 
Documents," Proceedings of ACL SIGPARSE 
Workshop on Punctuation in Computational 
Linguistics, 1996, pp. 19-24. 

Gaizauskas, R. and Wilks, Y. (1998) " Infornmtion 
Extraction: Beyond Document Retriew~l," 
Computational Linguistics and Chinese Language 
Processing, Vol. 3, No. 2, 1998, pp. 17-59. 

Green, E. and Krishnanloorthy, M. (1995) 
"Recognition of Tables Using Grammars," 
Proceedings of the Fourth Annual Symposium on 
Document Analysis and Information Retrieval, 
1995, pp. 261-278. 

Hurst, M. and Douglas, S. (1997) "Layout and 
Language: Preliminary Experiments in Assigning 
Logical Structure to Table Cells," Proceedings of 
the Fifth Conference on Applied Natural Language 
Processing, 1997, pp. 217-220. 

Hurst, M. (1999a) "Layout and Language: Beyond 
Simple Text for Information Interaction - Modeling 
the Table," Proceedings of the 2rid htternatiottal 
Conference on Multimodal hlterJ?tces, Hong Kong, 
January 1999. 

Hurst, M. (1999b) "Layout and Language: A Corpus 
ol' Documents Containing Tables," Proceedings of 
AAAI Fall Symposium: Usillg Layout for the 
Generation, Undelwtanding arm Retrieval oJ 
Documetttx, 1999. 

Mikheev, A. and Finch, S. (1995) "A Workbench for 
Acquisition of Ontological Knowledge from 
Natural Text," Proceedings of the 7th Conference 
of the European Chapter for Computational 
Linguistics, 1995, pp. 194-201. 

MUC (1998) Proceedittgs of 7 'h Message 
UndelwtatMing Conferetlce, hltp://www.muc.saic. 
corn/proccedings/proceedil~gs index.html. 

Ng, H.T.; Lira, C.Y. and Koo, J.L.T. (1999) 
"Learning Io Recognize Tables in Free Text," 
Proceedings of the 37th Annual Meeting of ACL, 
1999, pp. 443-450. 
