DESCRIPTION OF THE KENT RIDGE DIGITAL LABS SYSTEM
USED FOR MUC-7
Shihong Yu, Shuanhu Bai and Paul Wu
Kent Ridge Digital Labs
21 Heng Mui Keng Terrace
Singapore 119613
Email: shyu@krdl.org.sg, bai@krdl.org.sg, paulwu@krdl.org.sg
BASIC OF THE SYSTEM
We aim to build a single simple framework for tasks in text information extraction, for
which, to a certain extent, the required information can be resolved locally.
Our system is statistics-based. As usual, language model is built from training corpus.
This is the so-called learning process. Much e#0Bort has been spent to absorb domain knowl-
edge in the language model in a systematic and generic way, because the system is designed
not for one particular task, but for general local information extraction.
For the information extraction part #28tagging#29, the system consists of the following mod-
ules:
#0F Sentence segmentor and tokenizer. This module accepts a stream of characters as
input, and transforms it into a sequence of sentences and tokens. The way of tok-
enization can vary with di#0Berent tasks and domains. For example, most English text
is tokenized in the same way, while tokenization in Chinese itself is a research topic.
#0F Text analyzer. This module provides analysis necessary for the particular task, be
it semantic, syntactic, orthographic, etc. This same analyzer is also applied in the
learning process.
#0F Hypothesis generator. The possibilities for each word #28token#29 are determined. Rules
can be captured by letting one word have one choice, as is the case in the recognition
of time, date, money and percentage terms for the Chinese Named Entity #28NE#29 task.
These are identi#0Ced by pattern matching rules.
#0F Disambiguation module. This is essentially implementation of Viterbi algorithm.
All the above modules will be described in detail in the following sections.
TEXT INFORMATION EXTRACTION TO TAGGING
First of all, a brief of the modeling of the problem is in order. Each word in text is
assigned a tag, information can then be obtained from tags of all words. For example, for
the English NE task,
Example 1:
The#2F- British#2F- balloon#2F- ,#2F- called#2F- the#2F- Virgin#2F- Global#2F- Challenger#2F- ,#2F- is#2F- to#2F-
be#2F- #0Down#2F- by#2F- Richard#2FPERSON Branson#2FPERSON ,#2F- chairman#2F- of#2F- Virgin#2FORG
Atlantic#2FORG Airways#2FORG ;#2F-
Grouping all adjacent words with tag PERSON gives a person name, grouping those
with tag ORG gives an organization name, etc.
The problem becomes, for any given sequence of words w = w
1
w
2
:::w
n
, #0Cnding the
tags t = t
1
t
2
:::t
n
correspondingly.
Note that there are di#0Berent ways of assigning tags. For the above example, tags can
also be:
Example 1:
The#2F- British#2F- balloon#2F- ,#2F- called#2F- the#2F- Virgin#2F- Global#2F- Challenger#2F- ,#2F- is#2F- to#2F- be#2F-
#0Down#2F- by#2F- Richard#2FPERSON-start Branson#2FPERSON-end ,#2F- chairman#2F- of#2F- Virgin#2F
ORG-start Atlantic#2FORG-continue Airways#2FORG-end ;#2F-
Thisway, extra informationsuch as common surnames, #0Crst names, organization endings
#28Corp., Inc. etc#29 and so on can be obtained. It is observed that di#0Berent tags for a same
task make di#0Berence. We feel that choosing an appropriate tag set is a problem worthyof
careful investigation. Intuitively, a tag set for a particular task must be: su#0Ecient, meaning
that the information extracted must be su#0Ecient for the task; and e#0Ecient, meaning that
there should be no redundant and nonrelevant information.
LEARNING PROCESS: INFORMATION DISTILLATION OF TRAINING
CORPUS
Learning Process in General
Careful consideration has been given to study how to absorb domain knowledge in
language model#28s#29 in a generic and systematic way. The basic idea is, as much as possible
relevant and signi#0Ccant information #28to the task#29 contained in the original corpus should
retain in back-o#0B corpora where back-o#0B features are stored, so that correct decisions can
be made from the statistics generated from the back-o#0B corpora when they can not be done
from the statistics from the original training corpus.
The original training corpus is in the form of word#2Ftag, statistics about words and tags
including local contextual information can be obtained. Eachword in the corpus is given a
back-o#0B feature by the principle that the back-o#0B features of all words should extract the
most information from the corpus relevant to the particular task. The information loss is
compensated by gain of generosity. A back-o#0B corpus in the form of back-o#0B feature#2Ftag
is then generated, and statistics can be obtained in the same manner. The original corpus
is processed this way for a certain number of times. Every time, a less descriptive back-o#0B
corpus which gains more in generosity is generated, and thus the corresponding statistics.
For example, semantic classes can be used as back-o#0B features for all the words in
Example 1, which gives the back-o#0B corpus of the following form:
seman1#2F- seman2#2F- ... semanM-1#2FPERSON semanM#2FPERSON ... semanN-3 #2FORG
semanN-2#2FORG semanN-1#2FORG semanN#2F-
or part-of-speech as back-o#0B features, which gives
1st backoff
corpus
original 
training corpus
2nd backoff
corpus
Nth backoff
corpus
...
Information I          >           Information I1   >        Information I2   > ...     Information IN
Generosity G          <           Generosity G1   <        Generosity G2   < ...     Generosity GN
Figure 1: Information Distillation of Training Corpus
pos1#2F- pos2#2F- ... posM-1#2FPERSON - posM#2FPERSON ... posN-3 #2FORG posN-2#2FORG
posN-1#2FORG posn-1#2F-
The generation of back-o#0B corpora is describedby Figure1.The total numberofback-o#0B
corpora therein is a controllable parameter.
Learning Process for Chinese NE
#0F Training Corpus and Supporting Resources
We have a text corpus of about 500,000 words from People Daily and Xinhua News
Agency, all of whichwere manually checked for both word segmentation and part of
speech tagging.
In addition, we have a lexicon of 89,777 words, in which 5351 words are labeled as
geographic names, 304 words are people's name and 183 are organization names. 1167
words consist of more than 4 characters. The longest word #28meaning #5CGreat Britain
and North Ireland United Kingdom"#29 contains 13 characters.
About 50,000 di#0Berentwords appeared in the 500,000 words corpus.
We also have three entity name lists: people name list #2867,616 entries#29, location name
list #286,451 entries#29 and organization name list #286190 entries#29.
#0F Observation: Problems and Solutions
1. Intuitively, case information of proper names in English writing system pro-
vides good indication about locations and boundaries of entity names. There
are successful systems #5B2#5D which are built upon this intuition. Unfortunately, the
uniformity of character string in Chinese writing system does not contain such
information.
One should look for such analogous indicative characteristics which may be
unique in Chinese language.
2. Word in Chinese is a vague concept and there is no clear de#0Cnition for it. There
are boundary ambiguities between words in texts for even human being under-
standing, and inevitably machine processing. Tokenization, or word segmenta-
tion is still a problem in Chinese NLP.Word boundary ambiguities exist not only
between commonly used words which are not in entity names, but also between
commonly used words and entity names.
3. Besides the uniformity appearance of characters, proper names in Chinese can
consist of commonly used words. As a matter of fact, almost all Chinese charac-
ters can be a commonly used words themselves, including those in entity names
such as people's names, location names, etc.
Therefore, unlike English, the problem of Chinese entity recognition should not
be isolated from the problem of tokenization, or word segmentation.
#0F Building Language Models
One level of back-o#0B features, which are also called word classes, are obtained by the
following way:
We extend the idea in the new word detection engine of the integrated model of
Chinese word segmentor and part of speech tagger #5B1#5D. The idea is to extend the
scope of an interested word class of new word, the proper names, into named entities
by looking into broader range of constituents. Under this framework, we believe
contextual statistics plays important rules in deciding word boundary and predicting
the categories of named entities, while local statistics, or information resides within
words or entities, can provide evidence for suggesting the appearance of named entity
and deciding the validity of these entities. We need to make full use of both contextual
and local statistics to recognize these named entities, thus contextual language model
and entity models are created.
The basic process to build the model is like this:
1. Change the tag set of the part-of-speech tagger by splitting the tag NOUN into
more detailed tags related to the particular task, which include the symbolic
notions of person, location, organization, date, time, money and percentage.
2. Replace the tag NOUN in the training corpus with the above extended new tags.
Only ambiguous words are manually checked.
3. Build contextual language model with the training corpus with the new tag set.
4. Build entity models from the entity name lists. Eachentity has its own model.
Learning Process for English NE
#0F Training Corpus and Supporting Resources
SGML marked up #28for NE task only#29 Brown corpus and corpus from Wall Street
Journal. In total the size of words is 7.2MB, words with SGML-markup is 9.5MB.
Supporting resources include the location list, country list, corporation reference list
and the people's surname list provided by MUC. Only the single-word entries in these
lists are in actual use.
#0F Observation: Problems and Solutions
Case information, or more generally, orthographic information, gives good evidence of
names, as was observed in #5B2#5D. Although things get muddled up when one really gets
deep into it: e.g. #0Crst words of sentences, words which do not have all normal #28lower#29
case form #28e.g. #5CI"#29, or words whose cases are changed due to other reasons such as
formatting #28e.g. titles#29, being artifacts, etc. Nevertheless, this is an very important
information for identifying entity names.
Prepositions are also helpful, so are common su#0Exes and pre#0Cxes of the entities, such
as Corp., Mr., and so on. In general, all such useful information should be somehow
sorted out. Word classes tailored for this particular purpose will be ideal.
#0F Building Language Models
There are two levels of back-o#0B features represented byword classes.
For the following words, the two back-o#0B features are the same:
#7B Hand-crafted special words for NE task. Each possesses a di#0Berent word class
#28represented by word itself#29. These special words include #5CI", #5Cthe", #5Cpast",
#5Cpound", #5Cfollowing", #5Cof", #5Cin", #5CMay", etc. In total there are about 100 such
words;
#7B Words from the supporting resources #28as stated in the beginning of this section#29.
Words from a same list possess a same word class.
#7B Hand-crafted lists of words, which include week words #28Monday, Tuesday, ...#29,
month words #28January, February, ...#29, cardinal numbers #28one, two, 1 #18 31, ...#29,
ordinal numbers #281st, #0Crst, 2nd, second, ...#29, etc.
For the rest of words, the #0Crst level features are word classes provided by a machine
auto classi#0Ccation of words, while the second level of features include:
word class example
oneDigitNum 1
containsDigitAndColon 2:34
containsAlphaDigit A4
allCaps KRDL
capPeriod M.
#0CrstCommonWordInitCap
#0CrstNonCommonWordIC
CommonWordInitCap Department
initCapNotCommonWord David
mixedCasesWord ValueJet
charApos O'clock
allLowerCase can
compoundWord ad-hoc
In total, the number of orthographic features is about 30.
To give a sense what information is extracted from the original training corpus, for
example, the two back-o#0B sentences for Example 1 are:
Level 1:
the#2F- COUN ADJ#2F- WordClass1#2F- ,#2F- WordClass2#2F- the#2F- WordClass3#2F- WordClass4#2F
- WordClass5#2F- ,#2F- WordClass6#2F- to#2F- WordClass7#2F- WordClass8#2F- by#2F- WordClass9#2F
PERSON WordClass10#2FPERSON ,#2F- WordClass11#2F- of#2F- WordClass12#2FORG Loc#2F
ORG WordClass13#2Fslash ORG ;#2F-
Level 2:
the#2F- COUN ADJ#2F- LowerCaseWord#2F- ,#2F- LowerCaseWord#2F- the#2F- CommonWor-
dInitCap#2F- CommonWordInitCap#2F- CommonWordInitCap#2F- ,#2F- LowerCaseWord#2F-
to#2F- LowerCaseWord#2F- LowerCaseWord#2F- by#2F- initCapNotCommonWord#2FPERSON
initCapNotCommonWord#2FPERSON ,#2F- LowerCaseWord#2F- of#2F- CommonWordInit-
Cap#2FORG Loc#2FORG CommonWordInitCap#2FORG ;#2F-
Statistics such as the possibilitiesof CommonWordInitCap#28whichare NOT #0Crstwords
of sentences#29 and the corresponding frequencies can be obtained from the second back-
o#0B corpus. From our corpus, these are:
Organization 7525
None of the named entities 8493
Location 896
Person 195
Date 8
Money 2
From the above statistics, it's interesting to notice that non-#0Crst common words which
are initial capitalized have a far more chance to be organization than person #28frequen-
cies 7525 vs 195#29 and location #28frequencies 7525 vs 896#29. This agrees with general
observations. Also interesting is that suchwords have a higher chance not to be any
of the seven entities. This comes as a bit surprise. For NLP researchers, though, it
may not be a surprise at all. This example also gives a sense how general observations
are represented in a precise way.
Further research is to be carried out to justify quantitively the merits of this learning
process. Its full potential has yet to be exploited. So far, our experimentation has proved
that:
1. Various kinds of text analysis #28syntactic, semantic, orthographic, etc#29 can be incorpo-
rated into the same framework in a precise way, which will be used in the information
extraction #28tagging#29 stage in the same way;
2. It provides an easy way to absorb human knowledge as well as domain knowledge,
and thus customization can be done easily;
3. It gives great #0Dexibilityashow to optimize the system.
1 and 2 are somehow clear from the above discussion. Details on the disambiguation module
will reveal 3.
DETAILS OF THE SYSTEM MODULES
1. Sentence segmentor and tokenizer: initial tokenization by looking up dictionary for
Chinese, standard way for English.
2. Text analyzer. What has been done for training corpus in the learning stage is done
here. After the analysis, eachword possesses a given number of back-o#0B features.
3. Hypothesis generator.
#0F Chinese: based on entities' pre#0Cxes, su#0Exes, trigger words and local context in-
formation, guesses are made about possible boundaries of entities and categories
of entities. Time, date, money, andpercentage are extracted by pattern-matching
rules.
#0F English: for each word basically look for all the possibilities from the database
#0Crst. If the word is not found, look for the possibilities of its back-o#0B features.
4. Disambiguation module. Recall that information extraction from word sequence
w becomes #0Cnding the corresponding tag sequence t. In the paradigm of maxi-
mum likelihood estimation, the best set of tags t is the one such that prob#28tjw#29 =
max
t
0 prob#28t
0
jw#29. Thisisequivalentlyto #0Cndtsuchthat prob#28tw#29 = max
t
0 prob#28t`w#29
because prob#28t
0
jw#29=prob#28tw
0
#29=prob#28w#29 and prob#28w#29 is a constant for any given w.
The following equalityiswell-known:
prob#28tw#29 = prob#28t
1
#29 prob#28w
1
jt
1
#29 prob#28t
2
jt
1
w
1
#29 prob#28w
2
jt
1
w
1
t
2
#29
#01#01#01prob#28t
n
jt
1
w
1
:::t
n,1
w
n,1
#29prob#28w
n
jt
1
w
1
:::t
n,1
w
n,1
t
n
#29: #281#29
Computationally, it is only feasible when some #28actually most#29 dependencies are
dropped, for example,
prob#28t
k
jt
1
w
1
:::t
k,1
w
k,1
#29#19prob#28t
k
jt
k,1
t
k,2
#29; #282#29
prob#28w
k
jt
1
w
1
:::t
k,1
w
k,1
t
k
#29#19prob#28w
k
jt
k
t
k,1
#29: #283#29
#282#29 and #283#29 can be justi#0Ced by Hidden Markov Modeling for the generation of word
sequences.
As always, Viterbi algorithm is employed to compute the probability #281#29, given any
approximations like #282#29 and #283#29. When sparse data problem is encountered, back-o#0B
and smoothing strategy can be adopted, e.g.
prob#28w
k
jt
k
t
k,1
#29 backoff to ! prob#28w
k
jt
k
#29; #284#29
or for unknown words, substitute word in #284#29 with its back-o#0B features, e.g.
prob#28w
k
jt
k
t
k,1
#29 backoff to ! prob#28bof1
k
jt
k
t
k,1
#29
backoff to ! prob#28bof2
k
jt
k
t
k,1
#29 :::
backoff to ! prob#28bofN
k
jt
k
t
k,1
#29
backoff to ! prob#28bof1
k
jt
k
#29 ::: backoff to ! prob#28bofN
k
jt
k
#29;
where N is the total number of back-o#0B features for the word.
Note that no smoothing is employed in the abovescheme. From this scheme one can
see that there exist various ways of back-o#0B and smoothing. This characteristics, as
well as the free choices of back-o#0B features, is where the #0Dexibility of the system lies.
Remark. In the actual system, back-o#0B and smoothing schemes are di#0Berent from the
above. The actual schemes are not included because they are more complicated, and
yet no systematic experimentation has been done to show that they are better than
other options.
PERFORMANCE ANALYSIS
The system currently processes one sentence at a time, and no memory is kept once the
sentence is done. Furthermore, due to limitation of time, the guidelines for both Chinese
and English NE are not entirely followed, as we didn't have time to read the guidelines
carefully!
The F-measures of formal run for Chinese and English are 86.38#25 and 77.74#25, respec-
tively. Given the limited time #28less than six months#29 and resources #28three persons, all half
time#29, we are satisfactory with the performance.
* * * CHINESE NE SUMMARY SCORES * * *
P&R 2P&R P&2R
F-MEASURES 86.38 84.39 88.46
* * * ENGLISH SUMMARY SCORES * * *
P&R 2P&R P&2R
F-MEASURES 77.74 79.06 76.46
FUTURE RESEARCH DIRECTION
Our brief experimentation in Chinese and English Named Entity recognition shows that
the system has great potential that deserves further investigation.
1. Modeling of the problem: currently information and knowledge is represented in the
form of word#2Ftag. This may pose too much restriction. A better way of representing
information and knowledge, in other words, a better modeling of the problem, should
be studied.
2. Quantitive justi#0Ccation of the learning process #28knowledge distillation#29 should also be
studied. The system should be able to compare di#0Berent set of back-o#0B features and
thus the best one can be chosen.
3. The system provides great #0Dexibilityashow to optimize it. The optimization should
be done systematicly, rather than trial by trial as is the case for the time being.

References
#5B1#5D S. Bai, An IntegratedModel of Chinese WordSegmentation and Part of Speech Tagging, Advances and Applications on Computational Linguistics #281995#29, Tsinghua University Press.
#5B2#5D D.M. Bikel, S. Miller, R. Schwartz and R. Weischedel, Nymble: a High-Performance Learning Name-#0Cnder.
