File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/01/h01-1070_evalu.xml

Size: 2,259 bytes

Last Modified: 2025-10-06 13:58:48

<?xml version="1.0" standalone="yes"?>
<Paper uid="H01-1070">
  <Title>Towards an Intelligent Multilingual Keyboard System</Title>
  <Section position="6" start_page="22" end_page="22" type="evalu">
    <SectionTitle>
4. EXPERIMENTS
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="22" end_page="22" type="sub_section">
      <SectionTitle>
4.1 Language Identification
</SectionTitle>
      <Paragraph position="0"> To create an artificial corpus to test the automatic language switching, 10,000 random words from an English dictionary and 10,000 random words from a Thai dictionary are selected to build a corpus for language identification experiment. All characters in the test corpus are converted to their mapping characters of the same key button in normal mode (no shift key applied) without applying the language-switching key. For example, character 'GA2', 'GA7' and 'a' will be converted to 'a'. For the language identification, we employ the key-button bi-grams extracted As a conclusion the first 6 characters of the token are enough to yield a high accuracy on English-Thai language identification.</Paragraph>
    </Section>
    <Section position="2" start_page="22" end_page="22" type="sub_section">
      <SectionTitle>
4.2 Thai Key Prediction
</SectionTitle>
      <Paragraph position="0"> The sizes of training and test sets applied to our key prediction algorithm are 25 MB and 5 MB respectively. The table below shows the percentage of shift and unshift alphabets used in the corpora.</Paragraph>
      <Paragraph position="1">  Because the Thai language has no word boundary, we trained the trigram model from a 25-MB Thai corpus instead of a word list from a dictionary as in the language identification. The trigram model was tested on another 5-MB corpus (the test set). Similarly, a typing situation without applying shift key was simulated for the test. The result is shown in Table 4.</Paragraph>
      <Paragraph position="2">  From the errors of trigram key prediction when applied to the training corpus, about 12,000 error-correction rules are extracted and then reduced to 1,500. These error-correction rules are applied to the result of key prediction. The results are shown in the table below.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML