File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/05/p05-1064_intro.xml

Size: 3,234 bytes

Last Modified: 2025-10-06 14:03:03

<?xml version="1.0" standalone="yes"?>
<Paper uid="P05-1064">
  <Title>A Phonotactic Language Model for Spoken Language Identification</Title>
  <Section position="2" start_page="0" end_page="515" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> Spoken language and written language are similar in many ways. Therefore, much of the research in spoken language identification, LID, has been inspired by text-categorization methodology. Both text and voice are generated from language dependent vocabulary. For example, both can be seen as stochastic time-sequences corrupted by a channel noise. The n-gram language model has achieved equal amounts of success in both tasks, e.g. n-character slice for text categorization by language (Cavnar and Trenkle, 1994) and Phone Recognition followed by n-gram Language Modeling, or PRLM (Zissman, 1996) .</Paragraph>
    <Paragraph position="1"> Orthographic forms of language, ranging from Latin alphabet to Cyrillic script to Chinese characters, are far more unique to the language than their phonetic counterparts. From the speech production point of view, thousands of spoken languages from all over the world are phonetically articulated using only a few hundred distinctive sounds or phonemes (Hieronymus, 1994). In other words, common sounds are shared considerably across different spoken languages. In addition, spoken documents  , in the form of digitized wave files, are far less structured than written documents and need to be treated with techniques that go beyond the bounds of written language. All of this makes the identification of spoken language based on phonetic units much more challenging than the identification of written language. In fact, the challenge of LID is inter-disciplinary, involving digital signal processing, speech recognition and natural language processing.</Paragraph>
    <Paragraph position="2"> In general, a LID system usually has three fundamental components as follows: 1) A voice tokenizer which segments incoming voice feature frames and associates the segments with acoustic or phonetic labels, called tokens; 2) A statistical language model which captures language dependent phonetic and phonotactic information from the sequences of tokens; 3) A language classifier which identifies the language based on discriminatory characteristics of acoustic score from the voice tokenizer and phonotactic score from the language model.</Paragraph>
    <Paragraph position="3"> In this paper, we present a novel solution to the three problems, focusing on the second and third problems from a computational linguistic perspective. The paper is organized as follows: In Section 2, we summarize relevant existing approaches to the LID task. We highlight the shortcomings of existing approaches and our attempts to address the  A spoken utterance is regarded as a spoken document in this paper.</Paragraph>
    <Paragraph position="4">  issues. In Section 3 we propose the bag-of-sounds paradigm to turn the LID task into a typical text categorization problem. In Section 4, we study the effects of different settings in experiments on the  . In Section 5, we conclude our study and discuss future work.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML