Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the ACL, pages 1137–1144,
Sydney, July 2006. c©2006 Association for Computational Linguistics
Punjabi Machine Transliteration 
 
M. G. Abbas Malik 
Department of Linguistics 
Denis Diderot, University of Paris 7 
Paris, France 
abbas.malik@gmail.com 
 
  
 
Abstract 
Machine Transliteration is to transcribe a 
word written in a script with approximate 
phonetic equivalence in another lan-
guage. It is useful for machine transla-
tion, cross-lingual information retrieval, 
multilingual text and speech processing. 
Punjabi Machine Transliteration (PMT) 
is a special case of machine translitera-
tion and is a process of converting a word 
from Shahmukhi (based on Arabic script) 
to Gurmukhi (derivation of Landa, 
Shardha and Takri, old scripts of Indian 
subcontinent), two scripts of Punjabi, ir-
respective of the type of word. 
The Punjabi Machine Transliteration 
System uses transliteration rules (charac-
ter mappings and dependency rules) for 
transliteration of Shahmukhi words into 
Gurmukhi. The PMT system can translit-
erate every word written in Shahmukhi. 
1 Introduction 
Punjabi is the mother tongue of more than 110 
million people of Pakistan (66 million), India (44 
million) and many millions in America, Canada 
and Europe. It has been written in two mutually 
incomprehensible scripts Shahmukhi and Gur-
mukhi for centuries. Punjabis from Pakistan are 
unable to comprehend Punjabi written in Gur-
mukhi and Punjabis from India are unable to 
comprehend Punjabi written in Shahmukhi. In 
contrast, they do not have any problem to under-
stand the verbal expression of each other. Pun-
jabi Machine Transliteration (PMT) system is an 
effort to bridge the written communication gap 
between the two scripts for the benefit of the mil-
lions of Punjabis around the globe. 
Transliteration refers to phonetic translation 
across two languages with different writing sys-
tems (Knight & Graehl, 1998), such as Arabic to 
English (Nasreen & Leah, 2003). Most prior 
work has been done for Machine Translation 
(MT) (Knight & Leah, 97; Paola & Sanjeev, 
2003; Knight & Stall, 1998) from English to 
other major languages of the world like Arabic, 
Chinese, etc. for cross-lingual information re-
trieval (Pirkola et al, 2003), for the development 
of multilingual resources (Yan et al, 2003; Kang 
& Kim, 2000) and for the development of cross-
lingual applications.  
PMT is a special kind of machine translitera-
tion. It converts a Shahmukhi word into a Gur-
mukhi word irrespective of the type constraints 
of the word. It not only preserves the phonetics 
of the transliterated word but in contrast to usual 
transliteration, also preserves the meaning. 
Two scripts are discussed and compared. 
Based on this comparison and analysis, character 
mappings between Shahmukhi and Gurmukhi are 
drawn and transliteration rules are discussed. 
Finally, architecture and process of the PMT sys-
tem are discussed. When it is applied to Punjabi 
Unicode encoded text especially designed for 
testing, the results were complied and analyzed. 
PMT system will provide basis for Cross-
Scriptural Information Retrieval (CSIR) and 
Cross-Scriptural Application Development 
(CSAD). 
2 Punjabi Machine Transliteration 
According to Paola (2003), “When writing a for-
eign name in one’s native language, one tries to 
preserve the way it sounds, i.e. one uses an or-
thographic representation which, when read 
aloud by the native speaker of the language, 
sounds as it would when spoken by a speaker of 
the foreign language – a process referred to as 
Transliteration”. Usually, transliteration is re-
ferred to phonetic translation of a word of some 
1137
specific type (proper nouns, technical terms, etc) 
across languages with different writing systems. 
Native speakers may not understand the meaning 
of transliterated word. 
PMT is a special type of Machine Translitera-
tion in which a word is transliterated across two 
different writing systems used for the same lan-
guage. It is independent of the type constraint of 
the word. It preserves both the phonetics as well 
as the meaning of transliterated word. 
3 Scripts of Punjabi 
3.1 Shahmukhi 
Shahmukhi derives its character set form the 
Arabic alphabet. It is a right-to-left script and the 
shape assumed by a character in a word is con-
text sensitive, i.e. the shape of a character is dif-
ferent depending whether the position of the 
character is at the beginning, in the middle or at 
the end of the word. Normally, it is written in 
Nastalique, a highly complex writing system that 
is cursive and context-sensitive. A sentence illus-
trating Shahmukhi is given below: 
X}Z Ìáââ y6– ÌÐâ< ڻ 6– ~@null ð ÌÌ6=null P 
It has 49 consonants, 16 diacritical marks and 
16 vowels, etc. (Malik 2005) 
3.2 Gurmukhi 
Gurmukhi derives its character set from old 
scripts of the Indian Sub-continent i.e. Landa 
(script of North West), Sharda (script of Kash-
mir) and Takri (script of western Himalaya). It is 
a left-to-right syllabic script. A sentence illustrat-
ing Gurmukhi is given below: 
ਪੰਜਾਬੀ ਮੇਰੀ ਮਾਣ ਜੋਗੀ ਮnull ਬੋਲੀ ਏ. 
It has 38 consonants, 10 vowels characters, 9 
vowel symbols, 2 symbols for nasal sounds and 1 
symbol that duplicates the sound of a consonant. 
(Bhatia 2003, Malik 2005) 
4 Analysis and PMT Rules 
Punjabi is written in two completely different 
scripts. One script is right-to-left and the other is 
left-to-right. One is Arabic based cursive and the 
other is syllabic. But both of them represent the 
phonetic repository of Punjabi. These phonetic 
sounds are used to determine the relation be-
tween the characters of two scripts. On the basis 
of this idea, character mappings are determined. 
For the analysis and comparison, both scripts 
are subdivided into different group on the basis 
of types of characters e.g. consonants, vowels, 
diacritical marks, etc. 
4.1 Consonant Mapping 
Consonants can be further subdivided into two 
groups: 
Aspirated Consonants: There are sixteen as-
pirated consonants in Punjabi (Malik, 2005). Ten 
of these aspirated consonants (JJ[bʰ ], JJ [pʰ ], 
JJ [ṱʰ ], JJ [ʈʰ ], bY[ʤʰ ], bb[ʧʰ ], |e[ḓʰ ], |e[ɖʰ ], ÏÏ[kʰ ], 
ÏÏ [gʰ ]) are very frequently used in Punjabi as 
compared to the remaining six aspirates (|g[rʰ ], 
|h[ɽʰ ], Ïà[lʰ ], Jb [mʰ ], JJ [nʰ ], |z[vʰ ]). In 
Shahmukhi, aspirated consonants are represented 
by the combination of a consonant (to be aspi-
rated) and HEH-DOACHASHMEE (|). For 
example [ [b] + | [h] = JJ [bʰ ] and ` [ʤ ] + | [h] 
= Yb  [ʤʰ ]. 
 
In Gurmukhi, each frequently used aspirated-
consonant is represented by a unique character. 
But, less frequent aspirated consonants are repre-
sented by the combination of a consonant (to be 
aspirated) and sub-joined PAIREEN HAAHAA 
e.g. ਲ [l] + ◌੍ + ਹ [h] = ਲnull (Ïà) [lʰ ] and ਵ [v] + ◌੍ 
+ ਹ [h] = ਵnull )|z( [vʰ ], where ◌੍ is the sub-joiner. 
The sub-joiner character (◌੍) tells that the follow-
ing ਹ [h] is going to change the shape of 
PAIREEN HAAHHA. 
The mapping of ten frequently used aspirated 
consonants is given in Table 1. 
Sr. Shahmukhi Gurmukhi Sr.  Shahmukhi  Gurmukhi 
1  JJ [bʰ ] ਭ  6  bb [ʧʰ ] ਛ  
2  JJ  [pʰ ] ਫ  7  |e [ḓʰ ] ਧ  
3  JJ  [ṱʰ ] ਥ  8  |e [ɖʰ ] ਢ  
4  JJ  [ʈʰ ] ਠ  9  ÏÏ [kʰ ] ਖ  
5  bY [ʤʰ ] ਝ  10  ÏÏ  [gʰ ] ਘ  
Table 1: Aspirated Consonants Mapping 
The mapping for the remaining six aspirates is 
covered under non-aspirated consonants. 
Non-Aspirated Consonants: In case of non-
aspirated consonants, Shahmukhi has more con-
sonants than Gurmukhi, which follows the one 
symbol for one sound principle. On the other 
hand there are more then one characters for a 
single sound in Shahmukhi. For example, Seh 
1138
(_), Seen (k) and Sad (m) represent [s] and [s] 
has one equivalent in Gurmukhi i.e. Sassaa (ਸ). 
Similarly other characters like ਅ [a], ਤ [ṱ ], ਹ [h] 
and ਜ਼ [z] have multiple equivalents in Shah-
mukhi. Non-aspirated consonants mapping is 
given in Table 2. 
Sr. Shahmukhi Gurmukhi Sr. Shahmukhi Gurmukhi 
1 [ [b] 
ਬ 
21 o [ṱ ] 
ਤ 
2 \ [p] 
ਪ 
22 p [z] 
ਜ਼ 
3 ] [ṱ ] 
ਤ 
23 q [ʔ ] 
ਅ 
4 ^ [ʈ ] 
ਟ 
24 r [ɤ ] 
ਗ਼ 
5 _ [s] 
ਸ 
25 s [f] 
ਫ਼ 
6 ` [ʤ ] 
ਜ 
26 t [q] 
null 
7 a [ʧ ] 
ਚ 
27 u [k] 
ਕ 
8 b  [h] 
ਹ 
28 v [g] 
ਗ 
9 c [x] ਖ਼ 29 w [l] ਲ 
10 e [ḓ ] ਦ 30 wؕ  [ɭ ] ਲ਼ 
11 e [ɖ ] ਡ 31 x [m] ਮ 
12 f [z] ਜ਼ 32 y [n] ਨ 
13 g [r] ਰ 33 ڻ  [ɳ ] ਣ 
14 h [ɽ ] ੜ 35 y [ŋ ] ◌ਂ 
15 i [z] ਜ਼ 35 z [v] ਵ 
16 j [ʒ ] ਜ਼ 36 { [h] ਹ 
17 k [s] ਸ 37 | [h] ◌੍ਹ 
18 l [ʃ ] ਸ਼ 38 ~ [j] ਯ  
19 m [s] ਸ 39 } [j] ਯ  
20 n [z] ਜ਼    
Table 2: Non-Aspirated Consonants Mapping 
4.2 Vowel Mapping 
Punjabi contains ten vowels. In Shahmukhi, 
these vowels are represented with help of four 
long vowels (Alef Madda (W), Alef (Z), Vav (z) and 
Choti Yeh (~)) and three short vowels (Arabic 
Fatha – Zabar (F◌ ), Arabic Damma – Pesh (E◌ ) 
and Arabic Kasra – Zer (G◌ )). Note that the last 
two long vowels are also used as consonants. 
Hamza (Y) is a special character and always 
comes between two vowel sounds as a place 
holder. For example, in õGõ6 W [ɑsɑɪʃ ] (comfort), 
Hamza (Y) is separating two vowel sounds Alef (Z) 
and Zer (G◌ ), in zW [ɑo] (come), Hamza (Y) is 
separating two vowel sounds Alef Madda (W) [ɑ] 
and Vav (z) [o], etc. In the first example õGõ6 W 
[ɑsɑɪʃ ] (comfort), Hamza (Y) is separating two 
vowel sounds Alef (Z) and Zer (G◌ ), but normally 
Zer (G◌ ) is dropped by common people. So 
Hamza (Y) is mapped on ਇ [ɪ ] when it is followed 
by a consonant. 
In Gurmukhi, vowels are represented by ten 
independent vowel characters (ਅ, ਆ, ਇ, ਈ, ਉ, 
ਊ, ਏ, ਐ, ਓ, ਔ) and nine dependent vowel signs 
(◌ਾ, ਿ◌, ◌ੀ, ◌ੁ, ◌ੂ, ◌ੇ, ◌ੈ, ◌ੋ, ◌ੌ). When a vowel 
sound comes at the start of a word or is inde-
pendent of some consonant in the middle or end 
of a word, independent vowels are used; other-
wise dependent vowel signs are used. The analy-
sis of vowels is shown in Table 4 and the vowel 
mapping is given in Table 3. 
Sr. Shahmukhi Gurmukhi Sr. Shahmukhi Gurmukhi 
1 FZ [ə ] ਅ 11 Z[ə ] ਅ,◌ਾ 
2 ﺁ  [ɑ] 
ਆ 12 
G◌  [ɪ ] 
ਿ◌ 
3 GZ [ɪ ] ਇ 13 ﯼ  G◌  [i] ◌ੀ 
4 ﯼِا  [i] ਈ 14 E◌  [ʊ ] ◌ੁ 
5 EZ [ʊ ] ਉ 15 z E◌  [u] ◌ੂ 
6 zEZ [u] ਊ 16 } [e] ◌ੇ 
7 }Z [e] ਏ 17 } F◌  [æ] ◌ੈ 
8 }FZ [æ] ਐ 18 z [o] ◌ੋ 
9 zZ [o] ਓ 19 Fz [Ɔ] ◌ੌ 
10 zFZ [Ɔ] ਔ 20 Y [ɪ ] ਇ 
Table 3: Vowels Mapping
 
1139
 
Vowel Shahmukhi Gurmukhi Example 
ɑ 
Represented by Alef Madda (W) in the beginning 
of a word and by Alef (Z) in the middle or at the 
end of a word. 
Represented by ਆ 
and ◌ਾ 
ÌòeW →  ਆਦਮੀ [ɑdmi] (man) 
66 z6null →  ਜਾਵਣਾ [ʤɑ vɳɑ ] (go) 
ə  
Represented by Alef (Z) in the beginning of a 
word and with Zabar (F◌ ) elsewhere. 
Represented by ਅ 
in the beginning. 
H`Z →  ਅੱਜ [ɑʤʤ ] (today) 
e 
Represented by the combinations of Alef (Z) and 
Choti Yeh (~) in the beginning; a consonant and 
Choti Yeh (~) in the middle and a consonant and 
Baree Yeh (}) at the end of a word. 
Represented by ਏ 
and ◌ੇ 
uOääZ →  ਏਧਰ [eḓʰə r] (here), 
Z@null ð →  ਮੇਰਾ [merɑ] (mine), 
}g66 →  ਸਾਰੇ [sɑre] (all) 
æ 
Represented by the combination of Alef (Z), Za-
bar (F◌ ) and Choti Yeh (~) in the beginning; a 
consonant, Zabar (F◌ ) and Choti Yeh (~) in the 
middle and a consonant, Zabar (F◌ ) and Baree 
Yeh (}) at the end of a word. 
 
Represented by ਐ 
and ◌ੈ 
E} FZnull  →  ਐਹ [æh] (this), 
I‚F
r
 →  ਮੈਲ [mæl] (dirt), 
Fì →  ਹੈ [hæ] (is) 
ɪ  
Represented by the combination of Alef (Z) and 
Zer (G◌ ) in the beginning and a consonant and 
Zer (G◌ ) in the middle of a word. It never appears 
at the end of a word. 
Represented by ਇ 
and ਿ◌ 
âH§GZ →  ਇੱਕੋ [ɪ kko] (one), 
lGg66 →  ਬਾਿਰਸ਼ [bɑrɪ sh] (rain) 
i 
Represented by the combination of Alef (Z), Zer 
(G◌ ) and Choti Yeh (~) in the beginning; a 
consonant, Zer (G◌ ) and Choti Yeh (~) in the 
middle and a consonant and Choti Yeh (~) at the 
end of a word 
Represented by ਈ 
and ◌ੀ 
@nullnull GZ →  ਈਤਰ [iṱə r] (mean) 
~@null GðZ →  ਅਮੀਰੀ [ɑmiri] (rich-
ness), 
ÌÌ6=null P →  ਪੰਜਾਬੀ [pə nʤɑ bi] 
(Punjabi) 
ʊ  
Represented by the combination of Alef (Z) and 
Pesh (E◌ ) in the beginning; a consonant and Pesh 
(E◌ ) in the middle of a word. It never appears at 
the end of a word. 
Represented by ਉ 
and ◌ੁ 
uOHeEZ →  nullਧਰ [ʊḓḓ hr] (there) 
HIEï →  ਮੁੱਲ [mʊ ll] (price) 
u 
Represented by the combination of Alef (Z), Pesh 
(E◌ ) and Vav (z) in the beginning, a consonant, 
Pesh (E◌ ) and Vav (z) in the middle and at the end 
of a word. 
Represented by ਊ 
and ◌ੂ 
zEegEZ →  ਉਰਦੂ [ʊ rḓ u] 
]gâEß →  ਸੂਰਤ [surṱ ] (face) 
o 
Represented by the combination of Alef (Z) and 
Vav (z) in the beginning; a consonant and Vav 
(z) in the middle and at the end of a word. 
Represented by ਓ 
and ◌ੋ 
h6J zZnull  →  ਓਛਾੜ [oʧ hɑɽ ] (cover), 
iâð ww  →  ਪੜnullੋਲਾ [pɽ holɑ] (a big 
pot in which wheat is stored) 
Ɔ 
Represented by the combination of Alef (Z), Za-
bar (F◌ ) and Vav (z) in the beginning; a 
consonant, Zabar (F◌ ) and Vav (z) in the middle 
and at the end of a word. 
Represented by ਔ 
and ◌ੌ 
ZhzFZ →  ਔੜਾ [Ɔɽɑ ] (hindrance), 
]âFñ →  ਮੌਤ [mƆṱ ] (death) 
Note: Where →  means ‘its equivalent in Gurmukhi is’. 
Table 4: Vowels Analysis of Punjabi for PMT 
1140
4.3 Sub-Joins (PAIREEN) of Gurmukhi 
There are three PAIREEN (sub-joins) in Gur-
mukhi, “Haahaa”, “Vaavaa” and “Raaraa” shown 
in Table 5. For PMT, if HEH-DOACHASHMEE 
(|) does come after the less frequently used 
aspirated consonants then it is transliterated into 
PAIREEN Haahaa. Other PAIREENS are very 
rare in their usage and are used only in Sanskrit 
loan words. In present day writings, PAIREEN 
Vaavaa and Raaraa are being replaced by normal 
Vaavaa (ਵ) and Raaraa (ਰ) respectively. 
Sr. PAIREEN Shahmukhi Gurmukhi English 
1 H JHçE
o
 
ਬੁੱਲnull 
Lips 
2 R 6–gäs" 
ਚੰਦnullਮਾ 
Moon 
3 Í y6˜null FâÎ 
ਸnullੈਮਾਨ 
Self-
respect 
Table 5: Sub-joins (PAIREEN) of Gurmukhi 
4.4 Diacritical Marks 
Both in Shahmukhi and Gurmukhi, diacritical 
marks (dependent vowel signs in Gurmukhi) are 
the back bone of the vowel system and are very 
important for the correct pronunciation and un-
derstanding the meaning of a word. There are 
sixteen diacritical marks in Shahmukhi and nine 
dependent vowel sings in Gurmukhi (Malik, 
2005). The mapping of diacritical marks is given 
in Table 6. 
Sr. Shahmukhi Gurmukhi Sr. Shahmukhi Gurmukhi 
1 F◌  [ə ] --- 9 F◌  [ɪ n] ਿ◌ਨ 
2 G◌  [ɪ ] ਿ◌ 10 H◌  ◌ੱ 
3 E◌  [ʊ ] ◌ੁ 11 W◌  --- 
4 ؕ --- 12 Y◌  --- 
5 F◌  [ə n] ਨ 13 Y◌  --- 
6 E◌  [ʊ n] ◌ੂਨ 14 G◌  --- 
7 E◌  --- 15 
 
--- 
8 
 
--- 16 G◌  [ɑ] ◌ਾ 
Table 6: Diacritical Mapping 
Diacritical marks in Shahmukhi are very im-
portant for the correct pronunciation and under-
standing the meaning of a word. But they are 
sparingly used in writing by common people. In 
the normal text of Shahmukhi books, newspa-
pers, and magazines etc. one will not find the 
diacritical marks. The pronunciation of a word 
and its meaning would be comprehended with 
the help of the context in which it is used. 
 
For example, 
E} FZ
null
 uuu ~ww ~hâa }ZX  
@null ð~  ~hâa }Z wiX 
In the first sentence, the word ~hâa is pronounced 
as [ʧɔɽ i] and it conveys the meaning of ‘wide’. 
In the second sentence, the word ~hâa is pro-
nounced as [ʧ uɽ i] and it conveys the meaning of 
‘bangle’. There should be Zabar (F◌ ) after Cheh 
(a) and Pesh (E◌ ) after Cheh (a) in the first and 
second words respectively, to remove the ambi-
guities. 
It is clear from the above example that dia-
critical marks are essential for removing ambi-
guities, natural language processing and speech 
synthesis. 
4.5 Other Symbols 
Punctuation marks in Gurmukhi are the same as 
in English, except the full stop. DANDA (।) and 
double DANDA (॥) of Devanagri script are used 
for the full stop instead. In case of Shahmukhi, 
these are same as in Arabic. The mapping of dig-
its and punctuation marks is given in Table 7. 
Sr. Shahmukhi Gurmukhi Sr. Shahmukhi Gurmukhi 
1 0 ੦ 8 7 ੭ 
2 1 ੧ 9 8 ੮ 
3 2 ੨ 10 9 ੯ 
4 3 ੩ 11 Ô , 
5 4 ੪ 12 ? ? 
6 5 ੫ 13 ; ; 
7 6  ੬ 14 X । 
Table 7: Other Symbols Mapping 
4.6 Dependency Rules 
Character mappings alone are not sufficient for 
PMT. They require certain dependency or con-
textual rules for producing correct transliteration. 
The basic idea behind these rules is the same as 
that of the character mappings. These rules in-
clude rules for aspirated consonants, non-
aspirated consonants, Alef (Z), Alef Madda (W), 
Vav (z), Choti Yeh (~) etc. Only some of these 
rules are discussed here due to space limitations. 
Rules for Consonants: Shahmukhi conso-
nants are transliterated into their equivalent 
1141
Gurmukhi consonants e.g. k →  ਸ [s]. Any dia-
critical mark except Shadda (H◌ ) is ignored at this 
point and is treated in rules for vowels or in rules 
for diacritical marks. In Shahmukhi, Shadda (H◌ ) 
is placed after the consonant but in Gurmukhi, its 
equivalent Addak (◌ੱ) is placed before the con-
sonant e.g. \ + H◌  →  ◌ੱਪ [pp]. Both Shadda (H◌ ) 
and Addak (◌ੱ) double the sound a consonant 
after or before which they are placed. 
This rule is applicable to all consonants in Ta-
ble 1 and 2 except Ain (q), Noon (y), 
Noonghunna (y), Vav (z), Heh Gol ({), 
Dochashmee Heh (|), Choti Yeh (~) and Baree 
Yeh (}). These characters are treated separately. 
Rule for Hamza (Y): Hamza (Y) is a special 
character of Shahmukhi. Rules for Hamza (Y) are: 
− If Hamza (Y) is followed by Choti Yeh (~), then 
Hamza (Y) and Choti Yeh (~) will be 
transliterated into ਈ [i]. 
− If Hamza (Y) is followed by Baree Yeh (}), 
then Hamza (Y) and Baree Yeh (}) will be 
transliterated into ਏ [e]. 
− If Hamza (Y) is followed by Zer (G◌ ), then 
Hamza (Y) and Zer (G◌ ) will be transliterated 
into ਇ [ɪ ]. 
− If Hamza (Y) is followed by Pesh (E◌ ), then 
Hamza (Y) and Pesh (E◌ ) will be transliterated 
into ਉ [ʊ ]. 
In all other cases, Hamza (Y) will be transliter-
ated into ਇ [ɪ ]. 
5 PMT System 
5.1 System Architecture 
The architecture of PMT system and its func-
tionality are described in this section. The system 
architecture of Punjabi Machine Transliteration 
System is shown in figure 1. 
Unicode encoded Shahmukhi text input is re-
ceived by the Input Text Parser that 
parses it into Shahmukhi words by using simple 
parsing techniques. These words are called 
Shahmukhi Tokens. Then these tokens are given 
to the Transliteration Component. This 
component gives each token to the PMT Token 
Converter that converts a Shahmukhi Token 
into a Gurmukhi Token by using the PMT 
Rules Manager, which consists of character 
mappings and dependency rules. The PMT To-
ken Converter then gives the Gurmukhi To-
ken back to the Transliteration Compo-
nent.  When all Shahmukhi Tokens are con-
verted into Gurmukhi Tokens, then all Gurmukhi 
Tokens are passed to the Output Text Gen-
erator that generates the output Unicode en-
coded Gurmukhi text. The main PMT process is 
done by the PMT Token Converter and the 
PMT Rules Manager. 
Figure 1: Architecture of PMT System 
PMT system is a rule based transliteration sys-
tem and is very robust. It is fast and accurate in 
its working. It can be used in domains involving 
Information Communication Technology (web, 
WAP, instant messaging, etc.). 
5.2 PMT Process 
The PMT Process is implemented in the PMT 
Token Converter and the PMT Rules 
Manager. For PMT, each Shahmukhi Token is 
parsed into its constituent characters and the 
character dependencies are determined on the 
basis of the occurrence and the contextual 
placement of the character in the token. In each 
Shahmukhi Token, there are some characters that 
bear dependencies and some characters are inde-
pendent of such contextual dependencies for 
transliteration. If the character under considera-
tion bears a dependency, then it is resolved and 
transliterated with the help of dependency rules. 
Input Text Parser
PMT Rules Manager
Character 
Mappings 
Depend-
ency Rules 
Unicode Encoded 
Shahmukhi Text
Unicode Encoded 
Gurmukhi Text
PMT Token Converter
Shahmukhi Token 
Gurmukhi Token 
Punjabi Machine Transliteration 
System 
Output Text 
Generator
Transliteration 
Component
Shahmukhi Tokens
Gurmukhi Tokens
1142
If the character under consideration does not bear 
a dependency, then its transliteration is achieved 
by character mapping. This is done through map-
ping a character of the Shahmukhi token to its 
equivalent Gurmukhi character with the help of 
character mapping tables 1, 2, 3, 6 and 7, which-
ever is applicable. In this way, a Shahmukhi To-
ken is transliterated into its equivalent Gurmukhi 
Token. 
Consider some input Shahmukhi text S. First it 
is parsed into Shahmukhi Tokens (S
1
, S
2
… S
N
). 
Suppose that S
i 
= “y63„Zz” [vɑlejɑ̃] is the i
th 
Shah-
mukhi Token. S
i 
is parsed into characters Vav (z) 
[v], Alef (Z) [ɑ ], Lam (w) [l], Choti Yeh (~) [j], 
Alef (Z) [ɑ ] and Noon Ghunna (y) [ŋ ]. Then PMT 
mappings and dependency rules are applied to 
transliterate the Shahmukhi Token into a Gur-
mukhi Token. The Gurmukhi Token 
G
i
=“ਵਾਿਲਆਂ” is generated from S
i
. The step by 
step process is clearly shown in Table 8. 
Sr. 
Character(s) 
Parsed 
Gurmukhi 
Token 
Mapping or Rule Applied 
1 z →  ਵ [v] ਵ Mapping Table 4 
2 Z →  ◌ਾ [ɑ] ਵਾ Rule for ALEF 
3 w →  ਲ [l] ਵਾਲ Mapping Table 4 
4 
66  →  ਿ◌ਆ 
[ɪɑ ] 
ਵਾਿਲਆ Rule for YEH 
5 y →  ◌ਂ [ŋ ] ਵਾਿਲਆਂ 
Rule for 
NOONGHUNNA 
Note: →  is read as ‘is transliterated into’. 
Table 8: Methodology of PMTS 
In this way, all Shahmukhi Tokens are trans-
literated into Gurmukhi Tokens (G
1
, G
2
 … G
n
). 
From these Gurmukhi Tokens, Gurmukhi text G 
is generated. 
The important point to be noted here is that 
input Shahmukhi text must contain all necessary 
diacritical marks, which are necessary for the 
correct pronunciation and understanding the 
meaning of the transliterated word. 
6 Evaluation Experiments 
6.1 Input Selection 
The first task for evaluation of the PMT system 
is the selection of input texts. To consider the 
historical aspects, two manuscripts, poetry by 
Maqbal (Maqbal) and Heer by Waris Shah 
(Waris, 1766) were selected. Geographically 
Punjab is divided into four parts eastern Punjab 
(Indian Punjab), central Punjab, southern Punjab 
and northern Punjab. All these geographical re-
gions represent the major dialects of Punjabi. 
Hayms of Baba Nanak (eastern Punjab), Heer by 
Waris Shah (central Punjab), Hayms by Khawaja 
Farid (southern Punjab) and Saif-ul-Malooq by 
Mian Muhammad Bakhsh (northern Punjab) 
were selected for the evaluation of PMT system. 
All the above selected texts are categorized as 
classical literature of Punjabi. In modern litera-
ture, poetry and short stories of different poets 
and writers were selected from some issues of 
Puncham (monthly Punjabi magazine since 
1985) and other published books. All of these 
selected texts were then compiled into Unicode 
encoded text as none of them were available in 
this form before. 
The main task after the compilation of all the 
selected texts into Unicode encoded texts is to 
put all necessary diacritical marks in the text. 
This is done with help of dictionaries. The accu-
racy of the PMT system depends upon the neces-
sary diacritical marks. Absence of the necessary 
diacritical marks affects the accuracy greatly. 
6.2 Results 
After the compilation of selected input texts, they 
are transliterated into Gurmukhi texts by using 
the PMT system. Then the transliterated Gur-
mukhi texts are tested for errors and accuracy. 
Testing is done manually with help of dictionar-
ies of Shahmukhi and Gurmukhi by persons who 
know both scripts. The results are given in Table 
9. 
Source Total Words Accuracy 
Manuscripts 1,007 98.21 
Baba Nanak 3,918 98.47 
Khawaja Farid 2,289 98.25 
Waris Shah 14,225 98.95 
Mian Muhammad Bakhsh 7,245 98.52 
Modern lieratutre 16,736 99.39 
Total 45,420 98.95 
Table 9: Results of PMT System 
If we look at the results, it is clear that the 
PMT system gives more than 98% accuracy on 
classical literature and more than 99% accuracy 
on the modern literature. So PMT system fulfills 
the requirement of transliteration across two 
scripts of Punjabi. The only constraint to achieve 
this accuracy is that input text must contain all 
necessary diacritical marks for removing ambi-
guities. 
1143
7 Conclusion 
Shahmukhi and Gurmukhi being the only two 
prevailing scripts for Punjabi expressions en-
compass a population of almost 110 million 
around the globe. PMT is an endeavor to bridge 
the ethnical, cultural and geographical divisions 
between the Punjabi speaking communities. By 
implementing this system of transliteration, new 
horizons for thought, idea and belief will be 
shared and the world will gain an impetus on the 
efforts harmonizing relationships between na-
tions. The large repository of historical, literary 
and religious work done by generations will now 
be available for easy transformation and critique 
for all. The research has future milestone ena-
bling PMT system for back machine translitera-
tion from Gurmukhi to Shahmukhi. 
Reference 
Ari Pirkola, Jarmo Toivonen, Heikki Keskustalo, Kari 
Visala, and Kalervo Järvelin. 2003. Fuzzy Transla-
tion of Cross-Lingual Spelling Variants. In Pro-
ceedings of the 26th annual international ACM 
SIGIR conference on Research and development in 
informaion retrieval. pp: 345 – 352 
Baba Guru Nanak, arranged by Muhammad Asif 
Khan. 1998. " HH 66  666  63r W
i
 (Sayings of Baba Nanak in 
Punjabi Shahmukhi). Pakistan Punjabi Adbi Board, 
Lahore 
Bhatia, Tej K. 2003. The Gurmukhi Script and Other 
Writing Systems of Punjab: History, Structure and 
Identity. International Symposium on Indic Script: 
Past and future organized by Research Institute for 
the Languages and Cultures of Asia and Africa and 
Tokyo University of Foreign Studies, December 17 
– 19. pp: 181 – 213 
In-Ho Kang and GilChang Kim. 2000. English-to-
Korean transliteration using multiple unbounded 
overlapping phoneme chunks. In Proceedings of 
the 17th conference on Computational Linguistics. 
1: 418 – 424 
Khawaja Farid (arranged by Muhammad Asif Khan). 
" ääGuu  EbZâa  63r W
i
 (Sayings of Khawaja Farid in Punjabi 
Shahmukhi). Pakistan Punjabi Adbi Board, Lahore 
Knight, K. and Stalls, B. G. 1998. Translating Names 
and Technical Terms in Arabic Tex. Proceedings of 
the COLING/ACL Workshop on Computational 
Approaches to Semitic Languages 
Knight, Kevin and Graehl, Jonathan. 1997. Machine 
Transliteration. In Proceedings of the 35th Annual 
Meeting of the Association for Computational Lin-
guistics. pp. 128-135 
Knight, Kevin; Morgan Kaufmann and Graehl, Jona-
than. 1998. Machine Transliteration. In Computa-
tional Linguistics. 24(4): 599 – 612 
Malik, M. G. Abbas. 2005. Towards Unicode Com-
patible Punjabi Character Set. In proceedings of 
27th Internationalization and Unicode Conference, 
6 – 8 April, Berlin, Germany 
Maqbal. Gb äæ _âú . Punjabi Manuscript in Oriental Sec-
tion, Main Library University of the Punjab, 
Quaid-e-Azam Campus, Lahore Pakistan; 7 pages; 
Access # 8773 
Mian Muhammad Bakhsh (Edited by Fareer Mu-
hammad Faqeer). 2000. Saif-ul-Malooq. Al-Faisal 
Pub. Urdu Bazar, Lahore 
Nasreen AbdulJaleel, Leah S. Larkey. 2003. Statisti-
cal transliteration for English-Arabic cross lan-
guage information retrieval. In Proceedings of the 
12th international conference on information and 
knowledge management. pp: 139 – 146 
Paola Virga and Sanjeev Khudanpur. 2003. Translit-
eration of proper names in cross-language appli-
cations. In Proceedings of the 26th annual interna-
tional ACM SIGIR conference on Research and 
development in information retrieval. pp: 365 – 
366 
Rahman Tariq. 2004. Language Policy and Localiza-
tion in Pakistan: Proposal for a Paradigmatic 
Shift. Crossing the Digital Divide, SCALLA Con-
ference on Computational Linguistics, 5 – 7 Janu-
ary 2004 
Sung Young Jung, SungLim Hong and Eunok Peak. 
2000. An English to Korean transliteration model 
of extended markov window. In Proceedings of the 
17th conference on Computational Linguistics. 
1:383 – 389 
Tanveer Bukhari. 2000. zegEZ ÌÌ6=null  ›~P Ö. Urdu Science 
Board, 299 Uper Mall, Lahore 
Waris Shah. 1766. 6J Zg @null ¦6
=
. Punjabi Manuscript in Ori-
ental Section, Main Library University of the Pun-
jab, Quaid-e-Azam Campus, Lahore Pakistan; 48 
pages; Access # [Ui VI 135/]1443 
Waris Shah (arranged by Naseem Ijaz). 1977. 6J Zg @null ¦6
=
. 
Lehran, Punjabi Journal, Lahore 
Yan Qu, Gregory Grefenstette, David A. Evans. 2003. 
Automatic transliteration for Japanese-to-English 
text retrieval. In Proceedings of the 26th annual in-
ternational ACM SIGIR conference on Research 
and development in information retrieval. pp: 353 
– 360 
1144
