File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/97/w97-0907_intro.xml

Size: 1,802 bytes

Last Modified: 2025-10-06 14:06:30

<?xml version="1.0" standalone="yes"?>
<Paper uid="W97-0907">
  <Title>A Language Identification Application Built on the Java Client/Server Platform</Title>
  <Section position="3" start_page="0" end_page="0" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> In the 1.1 release of the Java Developers Kit(TM), a wide selection of text processing and internationalization interfaces have been added to the base Java package 1 making the package usable for multilingual  text processing. The Java programming language, the portable Java virtual machine and the basic web infrastructure of client web browsers and document resource protocols provide a widely deployed platform suitable for distributing NLP applications. Our research is targeted at shallow machine translation and summarization of multilingual web pages. 2 To properly bootstrap this technology we require appropriate language labels on documents. Language labels may be present at a whole document or collection of documents level for large granularity applications or at a structural component (SGML entity level) for fine grained uses. (Yergeau et al., 1997) Using an ngram language model (Dunning, 1994), we have explored a variety of mechanisms for adding language labels to legacy documents as part of the normal end user experience of the World Wide Web.</Paragraph>
    <Paragraph position="1"> Three obvious places the language labels could be added to legacy documents are within an end user web browser, within a document repository server and within an intermediary proxy server. We have experimented with several client/server configurations, and present the results of tradeoffs made between labelling accuracy and the size/completeness of the language models .</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML