File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/00/c00-2170_metho.xml
Size: 12,179 bytes
Last Modified: 2025-10-06 14:07:16
<?xml version="1.0" standalone="yes"?> <Paper uid="C00-2170"> <Title>Jurilinguistic Engineering in Cantonese Chinese: An N-gram-based Speech to Text Transcription System</Title> <Section position="4" start_page="1121" end_page="1124" type="metho"> <SectionTitle> 3. System Architecture </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="1121" end_page="1121" type="sub_section"> <SectionTitle> 3.1 Statistical Formulation </SectionTitle> <Paragraph position="0"> To resolve massive ambiguity in speech to text conversion, the N-gram model is used to determine the most probable character sequence {q ..... ck} given the input stenograph code sequence {s~ ..... Sk}. The conditional probability (1) is to be maximized.</Paragraph> <Paragraph position="1"> (1) P(q ..... c~l sl ..... sk) where {q ..... c~} stands for a sequence of N characters, and {sl ..... sk} for a sequence of k input stenograph codes.</Paragraph> <Paragraph position="2"> The co-occurrence frequencies necessary for computation are acquired through training. However, a huge amount of data is needed to generate reliable statistical estimates for (1) if N > 3. Consequently, N-gram probability is approximated by bigram or trigram estimates. First, rewrite (1) as (2) using Bayes' rule. P(c, ..... c )xP(s, ..... s lc, ..... c,) (2) P(s, ..... s k ) As the value of P(s I ..... st) remains unchanged for any choice of {q ..... ct}, one needs only to maximize the numerator in (2), i.e. (3).</Paragraph> <Paragraph position="3"> (3) P(cl ..... Ck) X P(sl ..... s,\[cl ..... ck) (3) can then be approximated by (4) or (5) using bigram and trigram models respectively.</Paragraph> <Paragraph position="4"> (4) FL=,...., (P(c,lq_,) x P(s,.Iq)) (5) xP(silci)) The transcription program is to compute the best sequence of {q ..... c,} so as to maximize (4) or (5). The advantage of the approximations in (4) and (5) is that P(s,lc,), P(c,lc,.,) and P(c,lc,_2c,_,) can be readily estimated using a training corpus of manageable size.</Paragraph> </Section> <Section position="2" start_page="1121" end_page="1122" type="sub_section"> <SectionTitle> 3.2 Viterbi Algorithm </SectionTitle> <Paragraph position="0"> The Viterbi algorithm (Viterbi, 1967) is implemented to efficiently compute the maxinmm value of (4) and (5) for different choices of character sequences. Instead o1' exhaustively computing tile values for all possible character sequences, the algorithm only keeps track of the probability of the best character sequence terminating in each possible character candidate for a stenograph code.</Paragraph> <Paragraph position="1"> In the trigram implelnentatiou, size limitation in the training cortms makes it impossible to estimate all possible P(cilci_2ci.i) because some {ci_2, ci_l, q} may never occur there. Following Jelinek (1990), P(cil ci.2ci_ i ) is approximated by the summation of weighted lrigram, bigram and unigram estimates in (6).</Paragraph> <Paragraph position="3"> where (i) w,, w2, w-s _> 0 are weights, (ii) wl-l-w2-{-H; 3 = 1, and (iii) Z f(q) is the stun of frequencies of all characters. Typically lhe best results can be obtained if w:~, the weight for trigram, is significantly greater than the olher two weights so that the trigram probability has dominant effect in the probability expression. In our tests, we sot wl=0.01, w2=0.09, aud u;3=0.9.</Paragraph> <Paragraph position="4"> The Viterbi algorithm substantially reduces the computational complexity flom O(m&quot;) to O(m.-~n) and O(nr~n) using bigram and trigram estimation rc:spectively where n is the number of stenograph code tokens in a sentence, and m is tile upper bound of the number of homophonous characters for a stenograph code.</Paragraph> <Paragraph position="5"> To maximize the transcription accuracy, we also refine the training corpus to ensure that the bigram and trigram statistical models reflect the comtroom lauguage closely. This is done by enlarging tile size of tile training corpus and by compiling domain-specific text corpora.</Paragraph> </Section> <Section position="3" start_page="1122" end_page="1122" type="sub_section"> <SectionTitle> 3.,3 Special Encoding </SectionTitle> <Paragraph position="0"> After some initial trial tests, error analysis was conducted to investigate the causes of the mistranscribed characters. It showed that a noticeable amount of errors were due to high failure rate in the mtriewtl of seine characters in the transcription. The main reason is that high fiequency characters are more likely to interfere with the correct retrieval of other relatively lower frequency homophouous characters. For example, Cantonese, hal ('to be') and hal ('at') are homophouous in terms of seglnental makeup.</Paragraph> <Paragraph position="1"> Their absolute fiequcucies in our training corpus are 8,695 and 1,614 respectively. Because of the large fi'equency discrepancy, the latter was mistranscribed as tile former 44% of the times in a trial test. 32 such high fi'equency characters were found to contribute to about 25% of all transcription errors. To minimize the interference, special encoding, which resulted flom shallow linguistic processing, is applied to the 32 characters so that each of them is assigned a unique stenograph code. This was readily accepted by the court stenographers.</Paragraph> <Paragraph position="2"> 4. hnplementation and Results</Paragraph> </Section> <Section position="4" start_page="1122" end_page="1122" type="sub_section"> <SectionTitle> 4.1 Compilation of Corpora </SectionTitle> <Paragraph position="0"> In our expreriments, authentic Chinese court proceedings from the Hong Kong Judiciary were used fox tile compilation of the training and testing corpora for the CAT prototypes. To ensure that tile training data is comparable with tile data to be transcribed, the training corpus should be large enough to obtain reliable estimates for P(silc,.), P(cilci j) and P(cilci_2ci_l). in our trials, we quickly approached the point of diminishing return when the size of the training corpus reaches about 0.85 million characters. (See Section 4.2.2.) To further enhance training, the system also exploited stylistic and lexical variations across different legal domains, e.g. tra\[.'fic, assauh, and fraud offences. Since different case types show distinct domain-specific legal vocabulary or usage, simply integrating all texts in a single training corpus may obscure the characteristics o1' specific language domains, thus degrading the modelling.</Paragraph> <Paragraph position="1"> Hence domain-specific training corpora were also compiled to enhance performance.</Paragraph> <Paragraph position="2"> Two sets of data were created for testing and comparison: Generic Coqms (GC) and Domain,specific Cmpus (DC). Whereas GC consists of texts representing various legal case types, DC is restricted to traffic offence cases. Each set consists of a training corpus of 0.85 million characters and a testing corpus of 0.2 million characters. The training corpus consists of Chinese characters along with the corresponding stenograph codes, and tile testing corpus consists solely of stenograph codes of the Chinese texts.</Paragraph> </Section> <Section position="5" start_page="1122" end_page="1123" type="sub_section"> <SectionTitle> 4.2 Experimental Results </SectionTitle> <Paragraph position="0"> For ewfluation, several prototypes were set up to test how different factors affected transcription accuracy. They included (i) use of bigram vs. trigram models, (ii) the size of the training corpora, (iii) domain-specific training, and (iv) special encoding. To measure conversion accuracy, the output text was compared with the original Chinese text in each test on a character by character basis, and the percentage of correctly transcribed characters was computed. Five sets of experiments are reported below.</Paragraph> <Paragraph position="1"> Three prototypes were developed: the Bigram Prototype, CA Tva2, the Trigram Prototype, CA Tva.~, and the Baseline Prototype, CATo. CATva2 and CATvA.~ implelnent the conversion engines using the bigram and trigram Viterbi algorithm respectively. CA7o, was set up to serve as an experimental control. Instead of implementing the N-gram model, conversion is accomplished by selecting the highest fiequency item out of the homophonous character set for each stenograph code. GC was used throughout the three experiments. The training and testing data sets are 0.85 and 0.20 million characters respectively. The results are summarized in Table 1.</Paragraph> <Paragraph position="2"> The application of the bigram and trigram models offers about 14% and 15% improvement in accuracy over Control Prototype, CATo.</Paragraph> <Paragraph position="3"> In this set of tests, the size of the training corpora was varied to determine the impact of the training corpus size on accuracy. The sizes tested are 0.20, 0.35, 0.50, 0.63, 0.73 and 0.85 million characters. Each corpus is a proper subset of the immediately larger corpus so as to ensure the comparability of he trainin texts. CATvA 2 was used in the tests. The results in Table 2 show that increasing the size of the training corpus enhances the accuracy incrementally. However, the point of diminishing return is reached when the size reaches 0.85 million characters. We also tried doubling the corpus size to 1.50 million characters. It only yields 0.8% gain over the 0.85 million character corpus.</Paragraph> </Section> <Section position="6" start_page="1123" end_page="1123" type="sub_section"> <SectionTitle> 4.2.3 Use of Domain-specific Training </SectionTitle> <Paragraph position="0"> This set of tests evaluates the effectiveness of domain-specific training. Data fi'oln the two corpora, GC and DC, are utilized in the training of the bigram and trigram prototypes. The size of each training set is 0.85 million characters. The same set of 0.2 million character testing data from DC is used in all four conversion tests. Without increasing the size of the training data, setups with domain-specific training consistently yield about 2% improvement. A more comprehensive set of corpora including Tra.lfic, Assault, and Robbeo~ is bein )iled and will be re )ortcd in future.</Paragraph> <Paragraph position="1"> Following shallow linguistic processing, special encoding assigns unique codes to 32 characters to reduce confusion with other characters. Another round of tests was repeated, identical to the CATvA2 and CATvA 3 tests in Section 4.2.1, except for the use of special encoding. The use of training and testing corpora have 0.85 and 0.20 encoding consistently offers about 2% increase in accuracy. Special encoding and hence shallow linguistic processing provide the most significant improvement in accuracy.</Paragraph> </Section> <Section position="7" start_page="1123" end_page="1124" type="sub_section"> <SectionTitle> 4.2.5 Incorporation of Domain-Specificity and Special Eneoding </SectionTitle> <Paragraph position="0"> As discussed above, both domain-specific training and special encoding raise the accuracy of transcription. The last set of tests deals with the integration of the two features. Special encoding is utilized in the training and testing data of DC which have 0.85 and 0.20 million characters respectively.</Paragraph> </Section> <Section position="8" start_page="1124" end_page="1124" type="sub_section"> <SectionTitle> Recall that Domain-Specificity and Special </SectionTitle> <Paragraph position="0"> Encoding each offers 2% improvelnent. Table 5 shows that combining BOTH features offer about 3% improvement over tests without them. (See non-domain-specific tests in Section 4.2.3) The 96.2% accuracy achieved by CATvA 3 represents the best performance of our system.</Paragraph> <Paragraph position="1"> The result is conaparable with other relevant advanced systems for speech to text conversion.</Paragraph> <Paragraph position="2"> For example, Lee (1999) reported 94% accuracy in a Chinese speech to text transcription system under developlnent with very large training corpus.</Paragraph> </Section> </Section> class="xml-element"></Paper>