File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/00/c00-2170_intro.xml

Size: 4,200 bytes

Last Modified: 2025-10-06 14:00:52

<?xml version="1.0" standalone="yes"?>
<Paper uid="C00-2170">
  <Title>Jurilinguistic Engineering in Cantonese Chinese: An N-gram-based Speech to Text Transcription System</Title>
  <Section position="3" start_page="0" end_page="1121" type="intro">
    <SectionTitle>
1. Introduction
</SectionTitle>
    <Paragraph position="0"> British rule in Hong Kong lnade English the only official language in the legal domain for over a Century. After the reversion of Hong Kong sovereignty to China in 1997, legal bilingualism has brought on an urgent need to create a Computer-Aided Transcription (CAT) system for Cantonese Chinese to produce and maintain the massive legally tenable records of court proceedings conducted in the local majority language (T'sou, 1993, Sin and T'sou, 1994, Lun et al., 1995). With the support fl'om the Hong Kong Judiciary, we have developed a transcription system for converting stenograph code to Chinese characters.</Paragraph>
    <Paragraph position="1"> CAT has been widely used for English for many years and awlilable R~r Mandarin Chinese, but none has existed for Cantonese. Althongh Cantonese is a Chinese dialect, Cantonese and Mandarin differ considerably in terms of phonological struclure, phouotactics, word morphology, vocabulary and orthogral)hy. Mutual intelligibility between the two dialects is generally very low. For example, while Cantonese has lnole than 700 distinct syllables, Mandarin has only about 400. Cantonese has 6 tone contours and Mandarin only 4. As for vocabulary, 16.5% of the words in a 1 million character corpus of court proceedings in Canlonese cannot be found in a corlms consisting of 30 million character newspaper texts in Modern Written Chinese (T'sou el al, 1997). For orthography, Mainhmd China uses the Simplified Chinese character set, and Hong Kong uses the Traditional set l~lus 4,702 special local Cantonese Chinese characters (Hong Kong Government, 1999). Such differences between Cantonese and Mandarin necessitate the Jtnilinguistic Engineering undertaking to develop an independent Cantonese CAT system for the local language environment.</Paragraph>
    <Paragraph position="2"> The major challenge in developing a Cantonese CAT system lies in the conversion of phonologically-based stenograph code into Chinese text. Chinese is a logographic language. Each character or logograph represents a syllable. While the total inventory of Cantonese syllable types is about 720, them am at least 14,000 Chinese character types. The limited syllabary creates many homophones in the language (T'sou, 1976). In a one million character corlms of court proceedings, 565 distinct syllable types were found, representing 2,922 distinct character types. Of the 565 syllable types, 470 have 2 or morn homophonous characters. In the extreme case, zi represents 35 homophonous character types.</Paragraph>
    <Paragraph position="3">  These 470 syllables represent 2,810 homophonous character types which account for 94.7% of the text, as shown in Figure \]. The homocode problem nmst be properly resolved to ensure  successful conversion.</Paragraph>
    <Paragraph position="4"> 2. Computer-Aided Transcription (CAT)</Paragraph>
    <Paragraph position="6"> Cantonese CAT system. Following typical courtroom CAT systems, our process is divided into three major stages. In Stage 1, simultaneous to a litigant speaking, a stenographer inputs speech, i.e. a sequence of transcribed syllables or stenograph codes, via a stenograph code generator.</Paragraph>
    <Paragraph position="7"> Each stenograph code basically stands for a syllable. In Stage 2, the transcription software converts the sequence of stenograph codes \[Sl ..... s,,} into the original character text {q ..... c,,}.</Paragraph>
    <Paragraph position="8"> This procedure requires the conversion component to be tightly bound to the phonology and orthography of a specific language. To specifically address homonymy in Cantonese, the conversion procedure in our system is supported by bigram and trigram statistical data derived from domain-specific training. In Stage 3, manual editing of the transcribed texts corrects errors from typing mistakes or his-transcription.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML