Integration Of Visual Inter-word Constraints And 
Linguistic Knowledge In Degraded Text Recognition 
Tao Hong 
Center of Excellence for Document Analysis and Recognition 
Department of Computer Science 
State University of New York at Buffalo, Buffalo, NY 14260 
t aohong@cs, buffalo, edu 
Abstract 1 2 3 4 
Degraded text recognition is a difficult task. Given a Please fin in tire 
noisy text image, a word recognizer can be applied to 0.90 0.33 0.30 0.80 Fleece fill In toe 
generate several candidates for each word image. High- o. o5 o. 30 o. 28 o. io 
level knowledge sources can then be used to select a Pierce flu io lire 
decision from the candidate set for each word image. 0.02 0.21 0.25 0.05 
In this paper, we propose that visual inter-word con- Fierce flit ill the 
straints can be used to facilitate candidate selection. 0.02 o. 10 o. 13 0.03 
Visual inter-word constraints provide a way to link word Pieces till Io Ike 
images inside the text page, and to interpret them sys- 0.01 0.o6 0.04 0.02 
tematically. 
Introduction 
The objective of visual text recognition is to transform 
an arbitrary image of text into its symbolic equivalent 
correctly. Recent technical advances in the area of doc- 
ument recognition have made automatic text recogni- 
tion a viable alternative to manual key entry. Given a 
high quality text page, a commercial document recog- 
nition system can recognize the words on the page at 
a high correct rate. However, given a degraded text 
page, such as a multiple-generation photocopy or fac- 
simile, performance usually drops abruptly(\[1\]). 
Given a degraded text image, word images can be ex- 
tracted after layout analysis. A word image from a de- 
graded text page may have touching characters, broken 
characters, distorted or blurred characters, which may 
make the word image difficult to recognize accurately. 
After character recognition and correction based on dic- 
tionary look-up, a word recognizer will provide one or 
more word candidates for each word image. Figure 1 
lists the word candidate sets for the sentence, "Please 
fill in the application form." Each word candidate has 
a confidence score, but the score may not be reliable 
because of noise in the image. The correct word candi- 
date is usually in the candidate set, but may not be the 
candidate with the highest confidence score. Instead of 
simply picking up the word candidate with the high- 
est recognition score, which may make the correct rate 
quite low, we need to find a method which can select a 
candidate for each word image so that the correct rate 
can be as high as possible. 
Contextual information and high-level knowledge can 
be used to select a decision word for each word image 
5 6 7 
application farm ! 
0.90 0.35 
applicators form 
0.05 0.30 
acquisition forth 
0.03 0.20 
duplication foam 
0.01 0.11 
implication force 
0.01 0.04 
Figure 1: Candidate Sets for the Sentence: 
in the application form/" 
"Please fill 
in its context. Currently, there are two approaches, 
the statistical approach and the structural approach, 
towards the problem of candidate selection. In the sta- 
tistical approach, language models, such as a Hidden 
Marker Model and word collocation can be utilized for 
candidate selection (\[2, 4, 5\]). In the structural ap- 
proach, lattice parsing techniques have been developed 
for candidate selection(\[3, 7\]). 
The contextual constraints considered in a statisti- 
cal language model, such as word collocation, are local 
constraints. For a word image, a candidate will be se- 
lected according to the candidate information from its 
neighboring word images in a fixed window size. The 
window size is usually set as one or two. In the lattice 
parsing method, a grammar is used to select a candi- 
date for each word image inside a sentence so that the 
sequence of those selected candidates form a grammat- 
ical and meaningful sentence. For example, consider 
the sentence "Please fill in the application form". We 
assume all words except the word "form" have been 
recognized correctly and the candidate set for the word 
"form" is { farm, form, forth, foam, forth } (see the 
second sentence in Figure 2). The candidate "form" 
can be selected easily because the collocation between 
"application" and "form" is strong and the resulting 
sentence is grammatical. 
The contextual information inside a small window or 
inside a sentence sometimes may not be enough to select 
a candidate correctly. For example, consider the sen- 
328 
Sentence 1 
1 
This 
2 
farm 
form 
forth 
foam 
force 
11 12 
Please fill 
3 4 5 6 7 8 9 10 
is almost the same as that one 
Sentence 2 
13 14 15 16 17 
in the application farm ! 
form 
forth 
foam 
force 
Figure 2: Word candidates of two example sen- 
tences(word images 2 and 16 are similar) 
® 
skill; it  iologica-t)ly based. LanKua ze 
is  ometh ing-G'K  born 
how to,'g_,,_zo_v¢. Yet hypofl\esis that 
%h re   t9Io:gicat unde. innings to 
human linguistic abili W does not ex- 
plain eve,-:ything. There may indeed 
versal elements. All ,k((dK'f  tan,ma -,es z 
s@certam orgamz a tional principles. 
Figure 3: Part of text page with three sentences 
tence "This form is almost the same as that one"(see 
the first sentence in Figure 2). Word image 16 has five 
candidates: { farm, form, forth, foam, forth }. After 
lattice parsing, the candidate "forth" will be removed 
because it does not fit the context. But it is difficult 
to select a candidate from "farm", "form" "foam" and 
"force" because each of them makes the sentence gram- 
matical and meaningful. In such a case, more contex- 
tual constraints are needed to distinguish the remaining 
candidates and to select the correct one. 
Let's further assume that the sentences in Figure 2 
are from the same text. By image matching, we know 
word images 2 and 16 are visually similar. If two word 
images are almost the same, they must be the same 
word. Therefore, same candidates must be selected for 
word image 2 and word image 16, After "form" is chosen 
for image 16 it can also be chosen as the decision for 
image 2. 
Possible Relations between W1. and W2 
type at symbolic level at image level 
W1--W2 
W2=XeWleY 
prefix_of(W1) = 
prefix_of(W2) 
suf yiz_oy(W1) = 
~u y yix_o y(W2 ) 
suyyiz_of(WQ = 
prefiz_of(W~) 
Note 1: "~" means approximately 
Note 2: "e" means concatenation. 
VV-~ ~ W2 
W1 ~ subimage_of(W2) 
left_part_of(W1) ,~ 
left_part_of(W2) 
right_part_of(W1) 
right_part_of(W2) 
right_part_of(W1) ,.~ 
left_part_of(W2) 
image matching; 
Table 1: Possible Inter-word Relations 
Visual Inter-Word Relations 
A visual inter-word relation can be defined between two 
word images if they share the same pattern at the image 
level. There are five types of visual inter-word relations 
listed in the right part of Table 1. Figure 3 is a part of 
a scanned text image in which a small number of word 
relations are circled to demonstrate the abundance of 
inter-word relations defined above even in such a small 
fragment of a real text page. Word images 2 and 8 are 
almost the same. Word image 9 matches the left part 
of word image 1 quite well. Word image 5 matches a 
part of the image 6, and so on. 
Visual inter-word relations can be computed by ap- 
plying simple image matching techniques. They can be 
calculated in clean text images, as well as in highly de- 
graded text fmages, because the word images, due to 
their relatively large size, are tolerant to noise (\[6\]). 
Visual inter-word relations can be used as constraints 
in the process of word image interpretation, especially 
for candidate selection. It is not surprising that word 
relations at the image level are highly consistent with 
word relations at the symbolic level(see Table 1). If two 
words hold a relation at the symbolic level and they are 
written in the same font and size, their word images 
should keep the same relation at the image level. And 
also, if two word images hold a relation at the image 
level, the truth values of the word images should have 
the same relation at the symbolic level. In Figure 3, 
word images 2 and 8 must be recognized as the same 
word because they can match each other; the identity 
of word image 5 must be a sub-string of the identity of 
word image 6 because word image 5 can match with a 
part of word image 6; and so on. 
Visual inter-word constraints provide us a way to link 
word images inside a text page, and to interpret them 
systematically. The research discussed in this paper in- 
tegrates visual inter-word constraints with a statistical 
language model and a lattice parser to improve the per- 
formance of candidate selection. 
329 
Current Status of Work 
A word-collocation-based relaxation algorithm and 
a probabilistic lattice chart parser have been de- 
signed for word candidate selection in degraded text 
recognition(\[3, 4\]). The relaxation algorithm runs iter- 
atively. In each iteration, the confidence score of each 
candidate is adjusted based on its current confidence 
and its collocation scores with the currently most pre- 
ferred candidates for its neighboring word images. Re- 
laxation ends when all candidates reach a stable state. 
For each word image, those candidates with a low con- 
fidence score will be removed from the candidate sets. 
Then, the probabilistic lattice chart parser will be ap- 
plied to the reduced candidate sets to select the can- 
didates that appear in the most preferred parse trees 
built by the parser. There can be different strategies to 
use visual inter-word constraints inside the relaxation 
algorithm and the lattice parser. One of the strategies 
we are exploiting is to re-evaluate the top candidates 
for the related word images after each iteration of re- 
laxation or after lattice parsing. If they hold the same 
relation at the symbolic level, the confidence scores of 
the candidates will be increased. Otherwise, the images 
with a low confidence score will follow the decision of 
the images with a high confidence score. 
Five articles from the Brown Corpus were chosen ran- 
domly as testing samples. They are A06, GO2, J42, NO1 
and ROT, each with about %000 words. Given a word 
image, our word recognizer generates its top10 candi- 
dates from a dictionary with 70,000 different entries. 
In preliminary experiments, we exploit only the type-1 
relation listed in Table 1. After clustering word im- 
ages by image matching, similar images will be in the 
same cluster. Any two images from the same cluster 
hold the type-1 relation. Word collocation data were 
trained from the Penn Treebank and the Brown Cor- 
pus except for the five testing samples. Table 2 shows 
results of candidate selection with and without using 
visual inter-word constraints. The top1 correct rate for 
candidate lists generated by a word recognizer is as low 
as 57.1%, Without using visual inter-word constraints, 
the correct rate of candidate selection by relaxation and 
lattice parsing is 83.1%. After using visual inter-word 
constraints, the correct rate becomes 88.2%. 
Article 
Number 
Of 
Words 
A06 2213 
G02 2267 
J42 2269 
N01 2313 
R07 2340 
Total 11402 
Word 
Recognition 
Result 
53.8% 
67.7% 
54.5% 
57.3% 
52.2% 
57.1% 
Candidate Selection 
Using No Using 
Constraints Constraints 
83.1% 88.5% 
83.8% 87.8% 
83.6% 89.5% 
82.7% 87.1% 
82.6% 88.1% 
83.1% 88.2% 
Table 2: Comparison Of Candidate Selection Results 
Conclusions and Future Directions 
Integration of natural language processing and image 
processing is a new area of interest in document anal- 
ysis. Word candidate selection is a problem we are 
faced with in degraded text recognition, as well as in 
handwriting recognition. Statistical language models 
and lattice parsers have been designed for the prob- 
lem. Visual inter-word constraints in a text page can 
be used with linguistic knowledge sources to facilitate 
candidate selection. Preliminary experimental results 
show that the performance of candidate selection is im- 
proved significantly although only one inter-word rela- 
tion was used. The next step is to fully integrate visual 
inter-word constraints and linguistic knowledge sources 
in the relaxation algorithm and the lattice parser. 
Acknowledgments 
I would like to thank Jonathan J. Hull for his support 
and his helpful comments on drafts of this paper. 

References 
\[1\] Henry S. Baird, "Document Image Defect Models 
and Their Uses," in Proceedings of the Second In- 
ternational Conference on Document Analysis and 
Recognition ICDAR-93, Tsukuba, Japan, October 
20-22, 1993, pp. 62-67. 
\[2\] Kenneth Ward Church and Patrick Hanks, "Word 
Association Norms, Mutual Information, and Lexi- 
cography," Computational Linguistics, Vol. 16, No. 
1, pp. 22-29, 1990. 
\[3\] Tao Hong and Jonathan J. Hull, "Text Recognition 
Enhancement with a Probabilistic Lattice Chart 
Parser," in Proceedings of the Second International 
Conference on Document Analysis and Recognition 
ICDAR-93, Tsukuba, Japan, October 20-22, 1993. 
\[4\] Tao Hong and Jonathan J. Hull, "Degraded Text 
Recognition Using Word Collocation," in Pro- 
ceedings of IS~T/SPIE Symposium on Document 
Recognition, San Jose, CA, February 6-10, 1994. 
\[5\] Jonathan J. Hull, "A Hidden Markov Model for 
Language Syntax in Text Recognition," in Pro- 
ceedings of llth IAPR International Conference on 
Pattern Recognition, The Hague, The Netherlands, 
pp.124-127, 1992. 
\[6\] Siamak Khoubyari and Jonathan J. Hull, "Keyword 
Location in Noisy Document Image," in Proceed- 
ings of the Second Annual Symposium on Docu- 
ment Analysis and Information Retrieval, Las Ve- 
gas, Nevada, pp. 217-231, April 26-28, 1993. 
\[7\] Masaru Tomita, "An Efficient Word Lattice Pars- 
ing Algorithm for Continuous Speech Recognition," 
in Proceedings of the International Conference on 
Acoustic, Speech and Signal Processing, 1986. 
