File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/w04-1614_metho.xml
Size: 6,161 bytes
Last Modified: 2025-10-06 14:09:19
<?xml version="1.0" standalone="yes"?> <Paper uid="W04-1614"> <Title>Urdu Localization Project: Lexicon, MT and TTS (ULP)</Title> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 ULP Architecture </SectionTitle> <Paragraph position="0"> As indicated, ULP comprises of three largely independent systems: Lexicon, MT and TTS, though these components may also be integrated to develop a written and oral interface to information. The project has three architectural layers. At the base are the core data and engines for Lexicon, MT and TTS. The middle layer provides public programming interfaces to these engines (APIs) so that they may be integrated with end-user applications at the top layer or used by third-party applications. Both the engine and API layer components are being developed in standard C/C++ to enable them to compile on all platforms (e.g. Microsoft, Linux, Unix). The user-end/top layer has to be technology centric and is currently being enabled in Microsoft platform. The lexicon will be given a web interface for user access. In addition, plug-ins for internet and email clients will be developed for MT and TTS to enable end-users to translate and re-display English websites in Urdu and also enable them to convert the translated Urdu text to speech. This is shown in Figure 1 below. In the figure the layers and systems are demarcated (horizontally and vertically respectively). The figure also shows that MT and TTS may be using the Lexicon through the APIs for getting appropriate data.</Paragraph> <Paragraph position="1"> through a grant by E-Government Directorate of Ministry of IT&Telecom., Government of Pakistan.</Paragraph> <Paragraph position="2"> These three systems are discussed briefly below.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.1 Urdu Lexicon </SectionTitle> <Paragraph position="0"> Urdu Computational Lexicon being designed would be holding more than 25 dimensions of a single headword. The first task to date has been to determine this hierarchical storage structure. The structure required for end-user has been finalized.</Paragraph> <Paragraph position="1"> However, requirements for computational applications, e.g. MT, are still being finalized.</Paragraph> <Paragraph position="2"> This was perhaps one of the most challenging tasks as there are currently no standards which exist, although some guidelines are available. In addition, Urdu also had some additional requirements (e.g. multiple plural forms, depending on whether the word is derived from Arabic or Sanskrit). Entries of more than thirty thousand headwords and complete entry of about a thousand headwords along with specification of at least 15 entries has already been done. Currently more content is being generated. In addition, work is under progress to define the physical structure of the lexicon (e.g. storage and retrieval models).</Paragraph> <Paragraph position="3"> The prototype showing this application is also available in Microsoft platform.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.2 English-Urdu Machine Translation </SectionTitle> <Paragraph position="0"> Work is under progress to develop English to Urdu MT engine. The translation is based on LFG formalism and is developing grammars, lexica and the parsing/mapping/generation engine for LFG.</Paragraph> <Paragraph position="1"> Mapping and Generation prototypes have already been developed and are integrated with a freely available LFG parser for internal testing. In addition sample grammars for English, Urdu and English-Urdu mapping have also been written.</Paragraph> <Paragraph position="2"> The prototype covers about 10 percent of grammatical rules and already translates within the limited vocabulary of the engine. The work is being extended to write the parser and rewrite mapper and generator and to develop English, Urdu and English Urdu grammars and lexica.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.3 Urdu Text to Speech System </SectionTitle> <Paragraph position="0"> The Urdu TTS is divided into two main part, the Urdu Natural Language Processor and Urdu Speech Synthesizer. The work on NLP is completed (except the intonational module, on which preliminary work has been completed). The NLP processor inputs Urdu Unicode text and output narrow phonetic transcription with syllable and stress markers. The NLP processor is integrated with Festival speech synthesis system (though by-passes its NLP module). A vocabulary of about 500 words is already defined at the diphones have been created. Prototype application is already developed which synthesized these single words. Work is currently in progress to define Urdu intonational and durational model. In addition, work is also under progress to extend the vocabulary and functionality to synthesize complete sentences. The functional prototype works on both Linux an Microsoft platforms.</Paragraph> </Section> </Section> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Conclusion </SectionTitle> <Paragraph position="0"> Most of the work being done in the project is novel. Urdu language is not very well defined for use with computers. Script, speech and language aspects of Urdu are being studied, documented and implemented in this project. The project is also testing the work which has been matured on western languages but only being recently exposed to other languages, e.g. the lexical recommendations by ISLE, LFG framework, use of LFG for MT, speech modeling of Urdu (both spectral and temporal) and more. Non-functional issues including performance are also being negotiated. Pre-compiled lexica, user-centric pre-stored performance-enhancing profiles and frequency lists, etc. are part of the architectural tasks being addressed. Though only initial work has been done, this work in itself is substantial, and has raised many questions which will be answered as the project progresses.</Paragraph> </Section> class="xml-element"></Paper>