File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/concl/05/i05-4010_concl.xml

Size: 3,086 bytes

Last Modified: 2025-10-06 13:54:36

<?xml version="1.0" standalone="yes"?>
<Paper uid="I05-4010">
  <Title>Harvesting the Bitexts of the Laws of Hong Kong From the Web</Title>
  <Section position="5" start_page="75" end_page="76" type="concl">
    <SectionTitle>
5 Conclusion
</SectionTitle>
    <Paragraph position="0"> We have presented in the above sections our recent work on harvesting and aligning the bitexts of the laws of Hong Kong, including basic techniques for downloading English-Chinese bilingual legal texts from BLIS official site, sound strategies for aligning the bitexts by utilizing the numbering system in the legal texts, and necessary XML annotation for the alignment results.</Paragraph>
    <Paragraph position="1"> The value of the outcomes, i.e., the subparagraph-aligned bilingual corpus, can be evaluated in terms of the following aspects.</Paragraph>
    <Paragraph position="2"> Corpus size The entire corpus is of 10.4M English words and 18.3M Chinese characters, several times larger than the well-known Penn Treebank Corpus in size.</Paragraph>
    <Paragraph position="3">  prepared by the Law Drafting Division of the Department of Justice, Hong Kong Government. Legal texts are known to be more precise and less ambiguous than most other types of text.</Paragraph>
    <Paragraph position="4"> Specificity and comprehensiveness The corpus covers specifically the domain of Hong Kong legislation. It is the most authoritative and complete text collection of the laws of Hong Kong.</Paragraph>
    <Paragraph position="5"> Alignment granularity The entire corpus is aligned precisely to the subparagraph level.</Paragraph>
    <Paragraph position="6"> Most subparagraphs in the legal texts are phrases, fragments of a clause, or clauses; as shown in Table 4.</Paragraph>
    <Paragraph position="7">  A bilingual corpus of this size and quality covering a specific domain so comprehensively is particularly useful not only in empirical MT research but also in computational studies of bilingual terminology and legislation. Our future work will focus on word alignment for inferring bilingual lexical resources and on automatic recognition of legal terminology.</Paragraph>
    <Paragraph position="8"> Also, our experience in constructing this bilingual corpus has laid a foundation for us to continue to harvest more bilingual text materials from the Web, e.g., from Hong Kong government's Web sites. We find that almost all Hong Kong government web sites, which are in large numbers, maintain their Web pages consistently parallel in English and Chinese. We are not sure if such bitexts in such pages are larger than that in the BLIS site in volume. We do know they cover a large number of distinct domains. This is particularly useful for MT. If we can harvest and align the bitexts from such Web pages efficiently via utilizing their intrinsic characteristics of URL correspondence and text structure, it would not be a dream any more to put an end to the time of having too few existing translation materials for empirical MT studies, at least, for the language pair of Chinese and English.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML