Jurilinguistic Engineering in Cantonese Chinese: 
An N-gram-based Speech to Text Transcription System 
B K T'sou, K K Sin, S W K Chan, T B Y Lai, C Lun, K T Ko, G K K Chan, L Y L Cheung 
Language hfformation Sciences Research Centre 
City University of Itong Kong 
Tat Chee Avenue, Kowloon 
Hong Kong SAR, China 
Email: rlbtsou @nxmail.cityu.edu.hk 
Abstract 
A Cantonese Chinese transcription system to 
automatically convert stenograph code to 
Chinese characters ix reported. The major 
challenge in developing such a system is the 
critical homocode problem because of 
homonymy. The statistical N-gram model is 
used to compute the best combination of 
characters. Supplemented with a 0.85 million 
character corpus of donmin-specific training 
data and enhancement measures, the bigram 
and trigrmn implementations achieve 95% 
and 96% accuracy respectively, as compared 
with 78% accuracy in the baseline model. The 
system perforlnance is comparable with other 
adwmced Chinese Speech-to-Text input 
applications under development. The system 
meets an urgent need o1' the .ludiciary ot: post- 
1997 Hong Kong. 
Keyword: Speech to Text, Statistical 
Modelling, Cantonese, Chinese, Language 
Engineering 
1. Introduction 
British rule in Hong Kong lnade English the only 
official  in the legal domain for over a 
Century. After the reversion of Hong Kong 
sovereignty to China in 1997, legal bilingualism 
has brought on an urgent need to create a 
Computer-Aided Transcription (CAT) system for 
Cantonese Chinese to produce and maintain the 
massive legally tenable records of court 
proceedings conducted in the local majority 
 (T'sou, 1993, Sin and T'sou, 1994, Lun 
et al., 1995). With the support fl'om the Hong 
Kong Judiciary, we have developed a 
transcription system for converting stenograph 
code to Chinese characters. 
CAT has been widely used for English for 
many years and awlilable R~r Mandarin Chinese, 
but none has existed for Cantonese. Althongh 
Cantonese is a Chinese dialect, Cantonese and 
Mandarin differ considerably in terms of 
phonological struclure, phouotactics, word 
morphology, vocabulary and orthogral)hy. Mutual 
intelligibility between the two dialects is generally 
very low. For example, while Cantonese has lnole 
than 700 distinct syllables, Mandarin has only 
about 400. Cantonese has 6 tone contours and 
Mandarin only 4. As for vocabulary, 16.5% of the 
words in a 1 million character corpus of court 
proceedings in Canlonese cannot be found in a 
corlms consisting of 30 million character 
newspaper texts in Modern Written Chinese 
(T'sou el al, 1997). For orthography, Mainhmd 
China uses the Simplified Chinese character set, 
and Hong Kong uses the Traditional set l~lus 
4,702 special local Cantonese Chinese characters 
(Hong Kong Government, 1999). Such 
differences between Cantonese and Mandarin 
necessitate the Jtnilinguistic Engineering 
undertaking to develop an independent Cantonese 
CAT system for the local  environment. 
The major challenge in developing a 
Cantonese CAT system lies in the conversion of 
phonologically-based stenograph code into 
Chinese text. Chinese is a logographic . 
Each character or logograph represents a syllable. 
While the total inventory of Cantonese syllable 
types is about 720, them am at least 14,000 
Chinese character types. The limited syllabary 
creates many homophones in the  (T'sou, 
1976). In a one million character corlms of court 
proceedings, 565 distinct syllable types were 
found, representing 2,922 distinct character types. 
Of the 565 syllable types, 470 have 2 or morn 
homophonous characters. In the extreme case, zi 
represents 35 homophonous character types. 
1121 
Coverage of l lomophonous and Non-homophous Characters 
100.00% ¢'o o~ 4:~ ~ ~os 4:e 
80.00% 6 
60.00% 
# 40.00% 
c~ 20.00% 
0.00% 
• No. of High Freq. Characters %~ t~ p 
\[EJ HoInophonous Charactcrs \[\] Non-honaophonous chaliactcrs\] 
Figure I. Covcragc of Homonymous Characters 
These 470 syllables represent 2,810 homophonous 
character types which account for 94.7% of the 
text, as shown in Figure \]. The homocode 
problem nmst be properly resolved to ensure 
successful conversion. 
2. Computer-Aided Transcription (CAT) 
~dJl 
t 
• I co 2F2L . 
Stcnoma~h code I i 
Slzg¢2 E 
I Transcription \[ | / ! 
EIt°illC ; | 
h Trigranl ,, Big,an, i ~ '9 
=, E ) 
! 
Proofreading 
\[ i ' I 
' Bigram / Trip, ram 
i Statistical Dala 
Chinese Text ill:~! 
Figure 2. Automatic Transcription Process 
Figure 2 outlines the transcription process in the 
Cantonese CAT system. Following typical 
courtroom CAT systems, our process is divided 
into three major stages. In Stage 1, simultaneous 
to a litigant speaking, a stenographer inputs 
speech, i.e. a sequence of transcribed syllables or 
stenograph codes, via a stenograph code generator. 
Each stenograph code basically stands for a 
syllable. In Stage 2, the transcription software 
converts the sequence of stenograph codes \[Sl ..... 
s,,} into the original character text {q ..... c,,}. 
This procedure requires the conversion 
component to be tightly bound to the phonology 
and orthography of a specific . To 
specifically address homonymy in Cantonese, the 
conversion procedure in our system is supported 
by bigram and trigram statistical data derived 
from domain-specific training. In Stage 3, manual 
editing of the transcribed texts corrects errors 
from typing mistakes or his-transcription. 
3. System Architecture 
3.1 Statistical Formulation 
To resolve massive ambiguity in speech to text 
conversion, the N-gram model is used to 
determine the most probable character sequence 
{q ..... ck} given the input stenograph code 
sequence {s~ ..... Sk}. The conditional probability 
(1) is to be maximized. 
(1) P(q ..... c~l sl ..... sk) 
where {q ..... c~} stands for a sequence of N 
characters, and {sl ..... sk} for a sequence of k 
input stenograph codes. 
The co-occurrence frequencies necessary for 
computation are acquired through training. 
However, a huge amount of data is needed to 
generate reliable statistical estimates for (1) if 
N > 3. Consequently, N-gram probability is 
approximated by bigram or trigram estimates. 
First, rewrite (1) as (2) using Bayes' rule. 
P(c, ..... c )xP(s, ..... s lc, ..... c,) (2) 
P(s, ..... s k ) 
As the value of P(s I ..... st) remains unchanged 
for any choice of {q ..... ct}, one needs only to 
maximize the numerator in (2), i.e. (3). 
(3) P(cl ..... Ck) X P(sl ..... s,\[cl ..... ck) 
(3) can then be approximated by (4) or (5) using 
bigram and trigram models respectively. 
(4) FL=,...., (P(c,lq_,) x P(s,.Iq)) 
(5) ×P(silci)) 
The transcription program is to compute the best 
sequence of {q ..... c,} so as to maximize (4) or 
(5). The advantage of the approximations in (4) 
and (5) is that P(s,lc,), P(c,lc,.,) and P(c,lc,_2c,_,) 
can be readily estimated using a training corpus of 
manageable size. 
3.2 Viterbi Algorithm 
The Viterbi algorithm (Viterbi, 1967) is 
implemented to efficiently compute the maxinmm 
value of (4) and (5) for different choices of 
1122 
character sequences. Instead o1' exhaustively 
computing tile values for all possible character 
sequences, the algorithm only keeps track of the 
probability of the best character sequence 
terminating in each possible character candidate 
for a stenograph code. 
In the trigram implelnentatiou, size limitation 
in the training cortms makes it impossible to 
estimate all possible P(cilci_2ci.i) because some 
{ci_2, ci_l, q} may never occur there. Following 
Jelinek (1990), P(cil ci.2ci_ i ) is approximated by 
the summation of weighted lrigram, bigram and 
unigram estimates in (6). 
(6) p(c i I ci_ 2 ci_ 1) 
f(ci_ 2 ci- I ci ) f(ci_ 1 c i ) f(c i ) = w 3 
X + w 2 X + w 1 X- 
f(ci_ 2 ci_ 1 ) f(ci_ I ) Z f(cj ) 
where (i) w,, w2, w-s _> 0 are weights, (ii) 
wl-l-w2-{-H; 3 = 1, and (iii) Z f(q) is the stun of 
frequencies of all characters. Typically lhe best 
results can be obtained if w:~, the weight for 
trigram, is significantly greater than the olher two 
weights so that the trigram probability has 
dominant effect in the probability expression. In 
our tests, we sot wl=0.01, w2=0.09, aud u;3=0.9. 
The Viterbi algorithm substantially reduces the 
computational complexity flom O(m") to O(m.-~n) 
and O(nr~n) using bigram and trigram estimation 
rc:spectively where n is the number of stenograph 
code tokens in a sentence, and m is tile upper 
bound of the number of homophonous characters 
for a stenograph code. 
To maximize the transcription accuracy, we 
also refine the training corpus to ensure that the 
bigram and trigram statistical models reflect the 
comtroom lauguage closely. This is done by 
enlarging tile size of tile training corpus and by 
compiling domain-specific text corpora. 
3.,3 Special Encoding 
After some initial trial tests, error analysis was 
conducted to investigate the causes of the mis- 
transcribed characters. It showed that a noticeable 
amount of errors were due to high failure rate in 
the mtriewtl of seine characters in the 
transcription. The main reason is that high 
fiequency characters are more likely to interfere 
with the correct retrieval of other relatively lower 
frequency homophouous characters. For example, 
Cantonese, hal ('to be') and hal ('at') are 
homophouous in terms of seglnental makeup. 
Their absolute fiequcucies in our training corpus 
are 8,695 and 1,614 respectively. Because of the 
large fi'equency discrepancy, the latter was mis- 
transcribed as tile former 44% of the times in a 
trial test. 32 such high fi'equency characters were 
found to contribute to about 25% of all 
transcription errors. To minimize the interference, 
special encoding, which resulted flom shallow 
linguistic processing, is applied to the 32 
characters so that each of them is assigned a 
unique stenograph code. This was readily 
accepted by the court stenographers. 
4. hnplementation and Results 
4.1 Compilation of Corpora 
In our expreriments, authentic Chinese court 
proceedings from the Hong Kong Judiciary were 
used fox tile compilation of the training and 
testing corpora for the CAT prototypes. To ensure 
that tile training data is comparable with tile data 
to be transcribed, the training corpus should be 
large enough to obtain reliable estimates for 
P(silc,.), P(cilci j) and P(cilci_2ci_l). in our trials, 
we quickly approached the point of diminishing 
return when the size of the training corpus reaches 
about 0.85 million characters. (See Section 4.2.2.) 
To further enhance training, the system also 
exploited stylistic and lexical variations across 
different legal domains, e.g. tra\[.'fic, assauh, and 
fraud offences. Since different case types show 
distinct domain-specific legal vocabulary or usage, 
simply integrating all texts in a single training 
corpus may obscure the characteristics o1' specific 
 domains, thus degrading the modelling. 
Hence domain-specific training corpora were also 
compiled to enhance performance. 
Two sets of data were created for testing and 
comparison: Generic Coqms (GC) and Domain- 
,specific Cmpus (DC). Whereas GC consists of 
texts representing various legal case types, DC is 
restricted to traffic offence cases. Each set 
consists of a training corpus of 0.85 million 
characters and a testing corpus of 0.2 million 
characters. The training corpus consists of 
Chinese characters along with the corresponding 
stenograph codes, and tile testing corpus consists 
solely of stenograph codes of the Chinese texts. 
4.2 Experimental Results 
For ewfluation, several prototypes were set up to 
1123 
test how different factors affected transcription 
accuracy. They included (i) use of bigram vs. 
trigram models, (ii) the size of the training 
corpora, (iii) domain-specific training, and (iv) 
special encoding. To measure conversion 
accuracy, the output text was compared with the 
original Chinese text in each test on a character by 
character basis, and the percentage of correctly 
transcribed characters was computed. Five sets of 
experiments are reported below. 
4.2.1 Bigram vs. Trigram 
Three prototypes were developed: the Bigram 
Prototype, CA Tva2, the Trigram Prototype, CA Tva.~, 
and the Baseline Prototype, CATo. CATva2 and 
CATvA.~ implelnent the conversion engines using 
the bigram and trigram Viterbi algorithm 
respectively. CA7o, was set up to serve as an 
experimental control. Instead of implementing the 
N-gram model, conversion is accomplished by 
selecting the highest fiequency item out of the 
homophonous character set for each stenograph 
code. GC was used throughout the three 
experiments. The training and testing data sets are 
0.85 and 0.20 million characters respectively. The 
results are summarized in Table 1. 
Corpus GC GC GC 
Accuracy 78.0% 92.4% 93.6% 
Table 1. Different N-gram Models 
The application of the bigram and trigram models 
offers about 14% and 15% improvement in 
accuracy over Control Prototype, CATo. 
4.2.2 Size of Training Corpora 
In this set of tests, the size of the training corpora 
was varied to determine the impact of the training 
corpus size on accuracy. The sizes tested are 0.20, 
0.35, 0.50, 0.63, 0.73 and 0.85 million characters. 
Each corpus is a proper subset of the immediately 
larger corpus so as to ensure the comparability of 
he trainin texts. CATvA 2 was used in the tests. 
Training Corpus GC GC GC 
Accurac~ 89.5% 91.2% 91.8% 
Training Corpus GC GC GC 
Accuracy 92.1% 92.3 % 92.4 % 
Table 2. Variable Training Data Size 
The results in Table 2 show that increasing the 
size of the training corpus enhances the accuracy 
incrementally. However, the point of diminishing 
return is reached when the size reaches 0.85 
million characters. We also tried doubling the 
corpus size to 1.50 million characters. It only 
yields 0.8% gain over the 0.85 million character 
corpus. 
4.2.3 Use of Domain-specific Training 
This set of tests evaluates the effectiveness of 
domain-specific training. Data fi'oln the two 
corpora, GC and DC, are utilized in the training of 
the bigram and trigram prototypes. The size of 
each training set is 0.85 million characters. The 
same set of 0.2 million character testing data from 
DC is used in all four conversion tests. Without 
increasing the size of the training data, setups with 
domain-specific training consistently yield about 
2% improvement. A more comprehensive set of 
corpora including Tra.lfic, Assault, and Robbeo~ is 
bein )iled and will be re )ortcd in future. 
Prototypes 
Training Data 
CATvA; CATvA3 CATvaz CATvA3 
GC GC DC DC 
Testing Data DC DC DC DC 
Accuracy 92.6% 92.8% 94.7% 94.8% 
Table 3. Application of Domain-Specificity 
4.2.4 Special Encoding 
Following shallow linguistic processing, special 
encoding assigns unique codes to 32 characters to 
reduce confusion with other characters. Another 
round of tests was repeated, identical to the 
CATvA2 and CATvA 3 tests in Section 4.2.1, except 
for the use of special encoding. The use of 
training and testing corpora have 0.85 and 0.20 
million characters respective 
S ~i~ I;:~ :::NOfA~I~Ii~: 
Prototypes CATw~ CATvA~ CATw2 CATvA3 
Corpus GC GC GC GC 
Accuracy 92.4% 93.6% 94.7% 95.6% 
Table 4. Application of Special Encoding 
Table 4 shows that the addition of special 
encoding consistently offers about 2% increase in 
accuracy. Special encoding and hence shallow 
linguistic processing provide the most significant 
improvement in accuracy. 
4.2.5 Incorporation of Domain-Specificity and 
Special Eneoding 
As discussed above, both domain-specific training 
and special encoding raise the accuracy of 
transcription. The last set of tests deals with the 
integration of the two features. Special encoding 
1124 
is utilized in the training and testing data of DC 
which have 0.85 and 0.20 million characters 
respectively. 
~raining/Testing, Data DC DC 
I S. Encoding Applied Applied / zcuracy 95.4 % 96.2 % 
Table 5. Integration of D. Specificity and S. Encoding 
Recall that Domain-Specificity and Special 
Encoding each offers 2% improvelnent. Table 5 
shows that combining BOTH features offer about 
3% improvement over tests without them. (See 
non-domain-specific tests in Section 4.2.3) 
The 96.2% accuracy achieved by CATvA 3 
represents the best performance of our system. 
The result is conaparable with other relevant 
advanced systems for speech to text conversion. 
For example, Lee (1999) reported 94% accuracy 
in a Chinese speech to text transcription system 
under developlnent with very large training 
corpus. 
5. Conclusion 
We have created a Cantonese Chinese CAT 
system which uses the phonologically-based 
stenograph machine. The system delivers 
encouragingly accurate transcription in a  
which has many hon\]ol~honous characters. To 
resolve problematic ambiguity in the conversion 
fi'on-i a I)honologically-based code to the 
logograt)hic Chinese characters, we made use of 
lhe N-gram statistical model. The Viterbi 
algorithm has enabled us to identify the most 
probable sequence of characters from the sels of 
possible homophonous characters. With the 
additional use of special encoding and domain- 
specific training, the Cantonese CAT system has 
attained 96% transcription accuracy. The success 
of the Jurilinguistic Engineering project can 
further enhance the efforts by the Hong Kong 
Judiciary to conduct trials in the  of the 
majority population. Further improvement to the 
system will include (i) more domain-specific 
training and testing across different case types, (2) 
firm-tuning for the optimal weights in the trigram 
formula, and (3) optilnizing the balance between 
training corpus size and shallow linguistic 
processing. 
Acknowledgement 
Support for the research reported here is provided 
mainly through the Research Grants Council of 
Hong Kong under Competitive Earmarked 
Research Grant (CERG) No. 9040326. 

References 

Hong Kong Govennnent. 1999. Hong Kong 
Szq~plemenmry Character Set. hfformation 
Technology Services Department & Official 
Languages Agency. 

Jelinek, F. 1990. "Self-organized Language Modeling 
lbr Speech Recognition." In A. Waibel and K.F. 
Lee, (eds.). Readings in Speech Recognition. San 
Mateo: CA: Morgan Kaufmann Publishers. 

Lee, K. F. 1999. "Towards a Multimedia, Multimodal, 
Multilingual Computer." Paper presented on behall' 
of Microsoft Research Institute, China in the 5th 
Natural Language Processing Pacil'ic Rim 
Symposium held in Belling, China, November 5-7, 
1999. 

Lun, S., K. K. Sin, B. K. T'sou and T. A. Cheng. 1997. 
"l)iannao Fuzhu Yueyu Suii Fangan." (The 
Cantonese Shorthand System for Computer-Aided 
Transcription) (in Chinese) Proceedings o.f the 5th 
lntermttional Confere,ce on Cantonese and Other 
Yue Dialects. B. H. Zhan (ed). Guangzhou: Jinan 
University Press. pp. 217--227. 

Sin, K. K. and B. K. T'sou. 1994. "Hong Kong 
Courtroon\] Language: Some Issues on Linguistics 
and Language Technology." Paper presented at lhe 
Third International Conference on Chinese 
Linguistics. Hong Kong. 

T'sou, B. K. 1976. "Homophony and Internal Change 
in Chinese." Computational Analyses of Asian and 
African Lauguages 3, 67--86. 

T'sou, t3. K. 1993. "Some Issues on Law and Language 
in the ltong Kong Special Administrative Region 
(HKSAR) of China." Language, Law attd Equality: 
Proceedings of the 3rd International Conference of 
the International A caden G of Language Law (IALL). 
K. Prinsloo et al. Pretoria (cds.): University of South 
Africa. pp. 314-331. 

T'sou, B. K., H. L. Lin, G. Liu, T. Chan, J. Hu, C. H. 
Chew, and J. K. P. Tse. 1997. "A Synchronous 
Chinese Language Corpus fi'om Different Speech 
Communities: Construction and Applications." 
Computational Linguistics and Chinese Language 
Ptwcessing 2:91-- 104. 

Viterbi, A. J. 1967. "Error Bounds for Convolution 
Codes and an Asymptotically Optimal l)ecoding 
Algorithm." IEEE 7'ransactions on htformation 
TheoG 13: 260--269. 
