File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/02/w02-1201_metho.xml
Size: 5,405 bytes
Last Modified: 2025-10-06 14:08:05
<?xml version="1.0" standalone="yes"?> <Paper uid="W02-1201"> <Title>A Study in Urdu Corpus Construction</Title> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Unicode </SectionTitle> <Paragraph position="0"> Another natural choice for storing data is to use the Unicode character set. The Unicode character set is another needed standard that we take advantage of in order to make our corpus data readily available to other researchers.</Paragraph> <Paragraph position="1"> The only reason for choosing to initially store text from BBC Urdu, and not other news agencies, is that the BBC publishes in the Unicode character set. Other news sites that publish in Urdu have gotten in the habit of publishing in graphics, presumably to avoid the hassles of arranging compatible fonts and character sets in the publishing software, systems, and client browsers. We think too it could be that Urdu publishers prefer Nastaliq-style font. There are probably a host of wonderful Nastaliq-style fonts available that work on legacy character sets, and, perhaps, publishers prefer to keep using these fonts.</Paragraph> <Paragraph position="2"> The choice to publish in graphics though makes it difficult for data harvesters to snag data from the Web. If one really wants the data that are published in graphic form, one has to rekey the text, scan it using optical character recognition technology, or contact the publisher for electronic copies of text, in which case one needs to be able to handle or convert from the character set in which the text was originally typed. In a previous project, we developed an application that can convert between 120 legacy character sets and can be customized to convert any other font or character set, so we should have minimal obstacles when it comes time to harvest non-Unicode data.</Paragraph> <Paragraph position="3"> Storing Urdu data in the Unicode character set eliminates some problems--however, we have found other problems related to different approaches to mapping Unicode-based fonts to the Arabic subset of Unicode.</Paragraph> <Paragraph position="4"> Unicode-based fonts seem to have been optimized for Arabic display, not for Urdu, so we have experienced difficulty displaying various forms of heh, noon ghunna, and hamza. We found the best Unicode-based font for properly displaying Urdu is Urdu Naskh Asiatype, available from the BBC Urdu Web site, at least among free fonts.</Paragraph> <Paragraph position="5"> We compared this font (presumably optimized for Urdu) and Arial Unicode MS (presumably not optimized for Urdu) and found that the letter heh and its variations are mapped differently in these two Unicode-based fonts variations of the letter heh For this reason, the metadata of the Urdu text in the corpus will contain the name of the Unicode-based font in which the text is stored. Any text processor that uses the data will have to normalize the usages of heh and its variations. In order to view the Urdu text properly in its surface form the font in which the data was harvested will have to be applied.</Paragraph> <Paragraph position="6"> Differences in font mappings are not much of a problem when handling English and other Roman-based orthographies, especially when using the Unicode character set, so special attention has to be paid to the different ways fonts display surface forms of Urdu letters.</Paragraph> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 4 Urdu input method </SectionTitle> <Paragraph position="0"> In order to add an Urdu document to our corpus that we only have in graphic form or hard copy, we spent significant time setting up our computer for Urdu Unicode input in order to be able to type into the corpus.</Paragraph> <Paragraph position="1"> Using the Arabic support on our computer, Microsoft Windows 2000 5.0 Service Pack 2, we were easily able to install right-to-left script support. Since Windows 2000 uses the Unicode character set internally, we did not have to do anything special to get Unicode support for our efforts.</Paragraph> <Paragraph position="2"> Devising a plan for inputting Urdu on the keyboard was the biggest challenge. We ended up using Tavultesoft Keyman software to map our own keyboard--it was very easy to use. Existing keyboard mappings for Arabic script-based languages, we found, are generally not phonetically mapped, meaning we would like Urdu letter feh to be mapped to the letter f on the keyboard and so forth. We did find one phonetically mapped keyboard that we liked for Persian [3], CRL Phonetic Layout, so we used that mapping as a basis for developing our own. It is not important that our keyboard mapping be standardized--it only need work for the one person typing our text.</Paragraph> <Paragraph position="3"> Conclusion In this paper, we presented the methodology we used to build an Urdu corpus. The process of corpora construction for South Asian languages, specifically Urdu, involves extra work because these languages are not written in a Roman-based script. The use of the Unicode character set and software that supports it makes building needed corpora in these languages possible and relatively easy. Once corpora in these languages become readily available, natural language processing work in these languages can move forward. null</Paragraph> </Section> class="xml-element"></Paper>