XML Viewer - w97-0907

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/97/w97-0907_metho.xml
Size: 13,876 bytes
Last Modified: 2025-10-06 14:14:45
<?xml version="1.0" standalone="yes"?>
<Paper uid="W97-0907">
  <Title>A Language Identification Application Built on the Java Client/Server Platform</Title>
  <Section position="4" start_page="0" end_page="43" type="metho">
    <SectionTitle>
2 Automatic Language Identification
</SectionTitle>
    <Paragraph position="0"> Although the general framework will support a variety of algorithms for automatic language identification, our implementation is based on Dunning's (1994) character ngram approach, which is conceptually quite simple and achieves good performance even given relatively small training sets (50K of training text is more than enough, and one can make do with as little as 1-2K), and even when the strings to be classified are quite short (e.g., 50 characters). Essentially, the method involves constructing a probabilistic model (or &amp;quot;profile&amp;quot;) based on character ngrams for each language in the classification set, and then performing classification of an  unknown string by selecting the model most likely to have generated that string.</Paragraph>
    <Paragraph position="1"> Although the model itself is quite simple, some subtle and not-so-subtle issues do arise in putting the algorithm into practice. First among these is the problem of matching the character set of the input text to the character set assumed by the language profiles -- for example, Shift-JIS and EUC-Japan are frequently used to encode Japanese documents in the PC and Unix worlds, respectively. Documents currently found on the WWW are often insufficiently labeled to indicate the language of text or the encoding of the characters within the document. We believe that use of Unicode will become increasingly widespread, obviating this problem, although for the time being we avail ourselves of the reasonable tools available 3 for identifying and converting among character sets.</Paragraph>
    <Paragraph position="2"> Second, sparse training data is a significant factor in ngram modeling, even at the level of character co-occurrences, since we consider character sequences up to 5 characters in length. We have experimented with simple add-k smoothing and, noting known problems with that method (Gale and Church, 1990), we have also experimented with Good-Turing smoothing (Good, 1953) - finding, to our surprise, that the simple &amp;quot;add 1/2&amp;quot; is only slightly less accurate.</Paragraph>
    <Paragraph position="3"> Table 2 shows the performance of the language identification algorithm when run on Dunning's (1994) English/Spanish test set, using language profiles for English and Spanish constructed from training sets of 50K characters each and varying the size of the ngrams and length of the test strings. This experiment used Good-Turing smoothing and also adopted a simplified approximation of conditional probabilities used by Dunning (personal communication) in his experiments. Each row of the table shows the language and length of the test strings, the ngram size, the number of test strings classified correctly and incorrectly, and the percentage correct.</Paragraph>
    <Paragraph position="4"> Third, in an environment where computation may more efficiently be done on the client side rather than the server side, the size of the language profiles becomes relevant, since computation cost must be traded off against communication cost for the data needed in order to perform the classification.</Paragraph>
    <Paragraph position="5"> Given that probabilities for low frequency items can be poorly estimated anyway, we have experimented with eliminating low-frequency items from the profile -- e.g., treating singletons (ngrams appearing just once in the training data) as if they never appeared at all and using the smoothed-zero value for them instead, thus trading model size against classification accuracy. Again to our surprise, we  have found that quite reasonable classification performance is sustained even when filtering out not only singletons but even ngrams that appear twice and even three times in the training set. (see Tables 3-5).</Paragraph>
    <Paragraph position="6"> Table 1 shows the dramatic size reduction that takes place as smaller window sizes are used in training the language models. In the current implementation plain text files are used for maximum portability of the resources. An application that uses a 5-gram model without filtering any of the training data would use a 220K model containing 13K observed ngrams, with an average accuracy of 98.68% for 100-500 character length strings. If the same application can function effectively with a marginally lower accuracy rate of 98.32%, then the same training data can be used to produce a profile a full order of magnitude smaller (a 23K profile containing only 1.6K ngrams), by using a trigram model and filtering out those trigrams whose observed frequency is less than 4. This 10X reduction in size for this particular resource could mean supporting 10 times as many languages with the same memory footprint or delivering the linguistic resource 10 times as fast for a client side computation.</Paragraph>
    <Paragraph position="7"> Finally, standardization of language labels must be addressed; this work follows the ISO standards for language and country codes for internationalization (\[SO, 1988b; ISO, 1988a; Yergeau et al., 1997; Alvestrand, 1995).</Paragraph>
  </Section>
  <Section position="5" start_page="43" end_page="45" type="metho">
    <SectionTitle>
3 Client/Server Architecture
</SectionTitle>
    <Paragraph position="0"> In designing a distributed application several decisions can be built into the architecture of the product or left as runtime decisions. By using a Java virtual machine as the target platform, the same code can run on a server machine or within the client graphical user interface. A sophisticated program can determine at startup whether the computation resources (memory and CPU) on the server machine or on the client workstation are better for the more complex algorithms.(In our testing we work with a SparcStation(TM) 10 file server, an Ultral(TM) with 500M of memory as a high end client and a SparcStation 2 remote client over a 28.8 Kb modem connection as a low end client.) In addition to compute resources, it is also important to consider the network bandwidth resources.</Paragraph>
    <Paragraph position="1"> The local area network configuration can make some simplifying assumptions that may not be appropriate for wide area network and remotely connected clients. We have explored the possibility of degraded application performance in exchange for reasonable response times for the remotely connected client, i.e., we have allowed a higher error rate on language labels of short text fragments in exchange for smaller language models which can easily be down-loaded over slower network connections by remote sites. In  this section, we discuss the incorporation of language identification at three possible locations: the client's Web browser, the document server, and between them at a proxy Web server.</Paragraph>
    <Section position="1" start_page="44" end_page="44" type="sub_section">
      <SectionTitle>
3.1 Client Web Browser
</SectionTitle>
      <Paragraph position="0"> We have experimented with two extreme client configurations. Our high end client has fast CPU and a large memory pool. Our low end client has both a slow CPU and small memory footprint. The high end client easily caches large language profiles and is capable of computing the best possible language labels. When the network resources are available to the high end client, it makes the most sense to perform the language labeling within the client browser. On the low end client, the available network bandwidth was the driving architectural consideration. When high bandwidth was available, delegating computations to the server system provides the best language labels and the best throughput to end users. In disconnected or low bandwidth situations, the client must perform its own labeling. In these situations, less accurate language labels with reasonable responsiveness is preferred over slow but more correct results.</Paragraph>
      <Paragraph position="1"> Three primary techniques were used to improve the responsiveness of the client side language labelling interfaces. Basically, they all attempt to minimize the work that is performed and to overlap the work whenever possible with other end user interactions. null * Asynchronous processing for perceived responsiveness. End users perceive system responsiveness in terms of its ability to react to their requests when they are presented to the system.</Paragraph>
      <Paragraph position="2"> Within our application there are clear points during system initialization and end user parameter selection when large amounts of network bandwidth and computation resources are needed. Using the builtin threading capabilities of the Java environment, we start the resource intensive operations when they are indicated, but allow the user to continue interacting with the user interface. If the user requests an operation that requires an uninitialized resource a message is presented and the application blocks until the resource is available.</Paragraph>
      <Paragraph position="3"> * Degraded language profiles for smaller footprint.</Paragraph>
      <Paragraph position="4"> Our language identification profiles have been built with 3, 4 and 5 character ngram windows.</Paragraph>
      <Paragraph position="5"> In addition to varying the ngram window size we have experimented with removal of singleton and doubleton observations in the training data.</Paragraph>
      <Paragraph position="6"> While this amplifies the sparse data problem, it does not significantly impact the end user perceived error rates for large granularity text objects, e.g., labeling a typical webpage with 1000 characters of textual information.</Paragraph>
      <Paragraph position="7"> * Subset of language profiles for specific user needs. We have been working with a mixture of western European and Asian languages 4. For remote clients it is worth the extra effort to preselect the languages that will be most beneficial to distinguish on the client machine. For demonstration purposes we use a dozen western languages and preload a few profiles during the initialization of the sample configuration.</Paragraph>
    </Section>
    <Section position="2" start_page="44" end_page="44" type="sub_section">
      <SectionTitle>
3.2 Document Server
</SectionTitle>
      <Paragraph position="0"> A typical document server is designed to service a large number of end user requests. While they are usually configured with large amounts of disk storage, they are not always the best computational re-sources available on the network. For static webpages, it is easy to include a language labeling tool for off-line document management. The labeling tool would be used to convert text/htmlfile into message/http files. 5 For real time information such as news wires or other database generated replies the same off line language labeling tools could be used with Common Gateway Interface (CGI) 6 scripts to automatically add language labels to dynamically generated webpages. null</Paragraph>
    </Section>
    <Section position="3" start_page="44" end_page="45" type="sub_section">
      <SectionTitle>
3.3 Proxy Web Server
</SectionTitle>
      <Paragraph position="0"> A proxy server is an intermediary program that provides value added functionality to documents as part of the transmission process. A proxy server could be configured close to the end user or close to the data source depending on the network topology. Proxy servers may also be employed as a shared workgroup or enterprise wide facility, e.g., department level proxies can share cached webpages, or an enterprise wide proxy could add an extra level of access controls. By introducing language labeling at a proxy server it is possible to combine the benefits of webpage caching and transparent content negotiation to reuse previously computed headers. 7 4We selected 200K of sample text from the WWW for the following languages: Chinese, Czech, Danish, Dutch, English, Finnish, French, German, Greek, Hungarian, Italian, Japanese, Korean, Norwegian, Polish, Portuguese, Spanish, Swedish, and Turkish. We use 50K of text for training the models, 50K for use in entropy calculations and 100K heldout for testing purposes. Preliminary experiments indicate that performance comparable to what we have seen with the English/Spanish test set will also be achieved with other language pairs.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="45" end_page="45" type="metho">
    <SectionTitle>
4 Java Classes
</SectionTitle>
    <Paragraph position="0"> The primary reusable Java module written for language labeling in our system is called a frequency table class. A main() routine is provided in the class to provide a stand-alone interface for generating new language profiles from training data. The generated language profiles are self documenting text files indicating the parameters used in creating the language model and the algorithms used for smoothing and filtering the training data. Methods are provided in the frequency table class for saving and loading the profiles to disk and for scoring individual strings from a loaded profile.</Paragraph>
    <Paragraph position="1"> Specialized classes were also written to provide connections within a client environment (in Java lingo an &amp;quot;applet&amp;quot;) and within a proxy HTTP server (again in Java lingo a &amp;quot;servlet&amp;quot;). In both the servlet and applet applications of the language labeling class the Java platform provided the basic class loading infrastructure to allow a common shared module and the distributed platform for running those algorithms transparently on a client or server system.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML