Unknown Word Extraction for Chinese Documents 
 
Keh-Jiann Chen 
Institute of Information science,  
Academia Sinica 
kchen@iis.sinica.edu.tw 
Wei-Yun Ma 
Institute of Information science,  
Academia Sinica 
ma@iis.sinica.edu.tw 
 
Abstract  
There is no blank to mark word boundaries in 
Chinese text. As a result, identifying words is 
difficult, because of segmentation ambiguities 
and occurrences of unknown words. 
Conventionally unknown words were extracted 
by statistical methods because statistical 
methods are simple and efficient. However the 
statistical methods without using linguistic 
knowledge suffer the drawbacks of low 
precision and low recall, since character strings 
with statistical significance might be phrases or 
partial phrases instead of words and low 
frequency new words are hardly identifiable by 
statistical methods. In addition to statistical 
information, we try to use as much information 
as possible, such as morphology, syntax, 
semantics, and world knowledge. The 
identification system fully utilizes the context 
and content information of unknown words in 
the steps of detection process, extraction process, 
and verification process. A practical unknown 
word extraction system was implemented which 
online identifies new words, including low 
frequency new words, with high precision and 
high recall rates. 
1 Introduction 
One of the most prominent problems in 
computer processing of Chinese language is 
identification of the word sequences of input 
sentences. There is no blank to mark word 
boundaries in Chinese text. As a result, 
identifying words is difficult, because of 
segmentation ambiguities and occurrences of 
unknown words (i.e. out-of-vocabulary words).  
Most papers dealing with the problem of word 
segmentation focus their attention only on the 
resolution of ambiguous segmentation. The 
problem of unknown word identification is 
considered more difficult and needs to be further 
investigated. According to an inspection on the 
Sinica corpus (Chen etc., 1996), a 5 million 
word Chinese corpus with word segmented, it 
shows that 3.51% of words are not listed in the 
CKIP lexicon, a Chinese lexicon with more than 
80,000 entries. 
Identifying Chinese unknown words from a 
document is difficult; since  
 
1. There is no blank to mark word boundaries; 
2. Almost all Chinese characters and words are also 
morphemes; 
3. Morphemes are syntactic ambiguous and semantic 
ambiguous; 
4. Words with same morpho-syntactic structure might 
have different syntactic categories; 
5. No simple rules can enumerate all types of unknown 
words; 
6. Online identification from a short text is even harder, 
since low frequency unknown words are not 
identifiable by naive statistical methods. 
 
It is difficult to identify unknown words in a 
text since all Chinese characters can either be a 
morpheme or a word and there are no blank to 
mark word boundaries. Therefore without (or 
even with) syntactic or semantic checking, it is 
difficult to tell whether a character in a 
particular context is a part of an unknown word 
or whether it stands alone as a word. Compound 
words and proper names are two major types of 
unknown words. It is not possible to list all of 
the proper names and compounds neither in a 
lexicon nor enumeration by morphological rules. 
Conventionally unknown words were extracted 
by statistical methods for statistical methods are 
simple and efficient. However the statistical 
methods without using linguistic knowledge 
suffer the drawbacks of low precision and low 
recall. Because character strings with statistical 
significance might be phrases or partial phrases 
instead of words and low frequency new words 

References 

1   Chang J. S., S. D. Chen, S. J. Ker, Y. Chen, & J. Liu, 1994 "A Multiple-Corpus Approach to Recognition of Proper Names in Chinese Texts", Computer Processing of Chinese and Oriental Languages, Vol. 8, No. 1, 75--85. 

2   Chang, Jing-Shin and Keh-Yih Su, 1997a. "An Unsupervised Iterative Method for Chinese New Lexicon Extraction", International Journal of Computational Linguistics & Chinese Language Processing, 1997. 

3   Chen, H. H., & J. C. Lee, 1994, "The Identification of Organization Names in Chinese Texts", Communication of COLIPS, Vol. 4 No. 2, 131--142. 

4   Keh-Jiann Chen , Shing-Huan Liu, Word identification for Mandarin Chinese sentences, Proceedings of the 14th conference on Computational linguistics, August 23-28, 1992, Nantes, France 

5   Chen, K. J., C. R. Huang, L. P. Chang & H. L. Hsu, 1996, "SINICA CORPUS: Design Methodology for Balanced Corpora," Proceedings of PACLIC 11th Conference, pp. 167--176. 

6   Chen, C. J., M. H. Bai, K. J. Chen, 1997, "Category Guessing for Chinese Unknown Words." Proceedings of the Natural Language Processing Pacific Rim Symposium 1997, pp. 35--40. NLPRS '97 Thailand. 

7   Chen, K. J. & Ming-Hong Bai, 1998, "Unknown Word Detection for Chinese by a Corpus-based Learning Method," international Journal of Computational linguistics and Chinese Language Processing, Vol. 3, #1, pp. 27--44. 

8   Chen, K. J., Chao-Jan Chen. 1998. "A Corpus Based Study on Computational Morphology for Mandarin Chinese ({Reference contained text which could not be captured.})." Quantitative and Computational Studies on the Chinese Language. Benjamin K. T'sou, Tom B. Y. Lai, Samuel W. K. Chan, William S-Y. Wang, ed. HK: City Univ. of Hong Kong. pp. 283--306. 

9   Chiang, T. H., M. Y. Lin, & K. Y. Su, 1992," Statistical Models for Word Segmentation and Unknown Word Resolution," Proceedings of ROCLING V, pp. 121--146. 

10   Chien, Lee-feng, 1999," PAT-tree-based Adaptive Keyphrase Extraction for Intelligent Chinese Information Retrieval," Information Processing and Management, Vol. 35, pp. 501--521. 

11   Kenneth W. Church , Robert L. Mercer, Introduction to the special issue on computational linguistics using large corpora, Computational Linguistics, v.19 n.1, March 1993 

12   Kenneth W. Church, Empirical estimates of adaptation: the chance of two noriegas is closer to p/2 than p2, Proceedings of the 18th conference on Computational linguistics, p.180-186, July 31-August 04, 2000, Saarbrcken, Germany 

13   Huang, C. R. Et al., 1995, "The Introduction of Sinica Corpus," Proceedings of ROCLING VIII, pp. 81--89. 

14   Lin, M. Y., T. H. Chiang, & K. Y. Su, 1993," A Preliminary Study on Unknown Word Problem in Chinese Word Segmentation," Proceedings of ROCLING VI, pp. 119--137. 

15   Frank Smadja, Retrieving collocations from text: Xtract, Computational Linguistics, v.19 n.1, March 1993 

16   Frank Smadja , Kathleen R. McKeown , Vasileios Hatzivassiloglou, Translating collocations for bilingual lexicons: a statistical approach, Computational Linguistics, v.22 n.1, p.1-38, March 1996 

17   Richard Sproat , William Gale , Chilin Shih , Nancy Chang, A stochastic finite-state word-segmentation algorithm for Chinese, Computational Linguistics, v.22 n.3, p.377-404, September 1996 

18   Sun, M. S., C. N. Huang, H. Y. Gao, & Jie Fang, 1994, "Identifying Chinese Names in Unrestricted Texts", Communication of COLIPS, Vol. 4 No. 2, 113--122. 
