Applying an NVEF Word-Pair Identifier to  
the Chinese Syllable-to-Word Conversion Problem 
 
Jia-Lin Tsai 
Intelligent Agent Systems Lab. 
Institute of Information Science, Academia Sinica, 
Nankang, Taipei, Taiwan, R.O.C. 
tsaijl@iis.sinica.edu.tw 
Wen-Lian Hsu 
Intelligent Agent Systems Lab. 
Institute of Information Science, Academia Sinica, 
Nankang, Taipei, Taiwan, R.O.C. 
hsu@iis.sinica.edu.tw 
 
Abstract  
Syllable-to-word (STW) conversion is important 
in Chinese phonetic input methods and speech 
recognition. There are two major problems in 
the STW conversion: (1) resolving the ambigu-
ity caused by homonyms; (2) determining the 
word segmentation. This paper describes a 
noun-verb event-frame (NVEF) word identifier 
that can be used to solve these problems effec-
tively. Our approach includes (a) an NVEF 
word-pair identifier and (b) other word identifi-
ers for the non-NVEF portion.  
Our experiment showed that the NVEF 
word-pair identifier is able to achieve a 99.66% 
STW accuracy for the NVEF related portion, 
and by combining with other identifiers for the 
non-NVEF portion, the overall STW accuracy is 
96.50%.  
The result of this study indicates that the NVEF 
knowledge is very powerful for the STW con-
version. In fact, numerous cases requiring dis-
ambiguation in natural language processing fall 
into such “chicken-and-egg” situation. The 
NVEF knowledge can be employed as a general 
tool in such systems for disambiguating the 
NVEF related portion independently (thus 
breaking the chicken-and-egg situation) and 
using that as a good fundamental basis to treat 
the remaining portion. This shows that the 
NVEF knowledge is likely to be important for 
general NLP. To further expand its coverage, we 
shall extend the study of NVEF to that of other 
co-occurrence restrictions such as noun-noun 
pairs, noun-adjective pairs and verb-adverb pairs. 
We believe the STW accuracy can be further 
improved with the additional knowledge. 
1. Introduction 
More than 100 Chinese input methods have been 
created in the past [1-6]. Currently, the most 
popular input method is based on phonetic 
symbols. Phonetic input method requires little 
training because Chinese are taught to write the 
corresponding pinyin syllable of each Chinese 
character in primary school. Since there are 
more than 13,000 distinct Chinese characters 
(with around 5400 commonly-used), but only 
1,300 distinct syllables, the homonym problem 
is quite severe in phonetic input method. 
Therefore, an intelligent syllable-to-word (STW) 
conversion for Chinese is very important. A 
comparable (but easier) problem to the STW 
conversion in English is the word-sense 
disambiguation. 
There are basically two approaches for the STW 
conversion: (a) the linguistic approach based on 
syntax parsing or semantic template matching 
[3,4,7,8] and (b) the statistical approach based 
on the n-gram model where n is usually 2 or 3 
[9-12]. The linguistic approach is more laborious 
but the end result can be more user friendly. On 
the other hand, the statistical approach is less 
labor intensive, but its power is dependent on 
training corpus and it usually does not provide 
deep semantic information. Our approach adopts 
the semantically oriented NVEF word-pairs (to 
be defined formally in Section 2.1) plus other 
statistical methods so that not only the result 
makes sense semantically, but the model is also 
fully automatic provided that enough NVEFs 
have already been collected.  
According to the studies in [13], good syllable 
sequence segmentation is crucial for the STW 
conversion. For example, consider the syllable 
sequence “zhe4 liang4 che1 xing2 shi3 shun4 
