File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/01/h01-1002_intro.xml

Size: 2,863 bytes

Last Modified: 2025-10-06 14:01:06

<?xml version="1.0" standalone="yes"?>
<Paper uid="H01-1002">
  <Title>Translating Hong Kong News Training News News News Legal LangModel Legal News Prior Legal</Title>
  <Section position="3" start_page="0" end_page="0" type="intro">
    <SectionTitle>
2. ENHANCEMENTS
</SectionTitle>
    <Paragraph position="0"> The first change required of the translation software was support for the two-byte encoding used for the Chinese text (GB-2312, &amp;quot;GB&amp;quot; for short). Further, the EBMT (as well as dictionary and glossary) approaches are word-based, but Chinese is ordinarily written without breaks between words. Thus, Chinese input must be .</Paragraph>
    <Paragraph position="1"> segmented into individual words. The initial baseline system used the segmenter made available by the LDC. This segmenter uses a word-frequency list to make segmentation decisions, but although the list provided by the LDC is large, it did not completely cover the vocabulary of the EBMT training corpus (described below). As a result, many sentences had incorrect segmentations or included long sequences which were not segmented at all or were broken into single characters. Almost every Chinese character has at least one meaning, and its meaning may be entirely different from the meaning of the word containing it. The mis-segmenting of Chinese words due to the inadequate dictionary makes it very hard to build a statistical dictionary and properly index the EBMT corpus.</Paragraph>
    <Paragraph position="2"> To improve the performance of the Chinese segmenter, we augmented its word list by finding sequences of characters in the training corpus that belong together, based on their frequency and high mutual information. We developed a form of term extraction to find English phrases which should be treated as atomic units for translation, thus increasing the average length of &amp;quot;words&amp;quot; in both source and target languages. Finally, we also created an augmented bilingual dictionary for use in word-level alignment for EBMT by applying statistical dictionary extraction techniques to the training corpus.</Paragraph>
    <Paragraph position="3"> As the improved segmenter and the term finder may be producing excessively long phrases or phrases which are impossible to match in the other language, we repeat the procedure of segmenting/bracketing/dictionary-building several times. On each successive iteration, the segmenter and bracketer are limited to words and phrases for which the statistical dictionary from the previous iteration contains translations. Through this iteration, we increased the size of the statistical dictionary from each step and guaranteed that all Chinese words generated by the segmenter have translations in the dictionary. This helps ensure that the EBMT engine can perform word-level alignments.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML