XML Viewer - i05-4010

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/i05-4010_metho.xml
Size: 14,447 bytes
Last Modified: 2025-10-06 14:09:42
<?xml version="1.0" standalone="yes"?>
<Paper uid="I05-4010">
  <Title>Harvesting the Bitexts of the Laws of Hong Kong From the Web</Title>
  <Section position="3" start_page="71" end_page="73" type="metho">
    <SectionTitle>
2 Bilingual Texts of the Laws of HK
</SectionTitle>
    <Paragraph position="0"> The laws of Hong Kong (HK) before 1987 were exclusively enacted in English. They were translated into Chinese in the run-up to the handover in 1997. Since then all HK laws have been enacted in both English and Chinese, both versions being equally authentic. This gives rise to a valuable set of bitexts in large quantity and high quality that can be utilized to facilitate empirical MT research.</Paragraph>
    <Section position="1" start_page="71" end_page="71" type="sub_section">
      <SectionTitle>
2.1 BLIS Corpus
</SectionTitle>
      <Paragraph position="0"> The bilingual texts of the laws of Hong Kong have been made available to the public in recent years by the Justice Department of the HKSAR through the bilingual laws information system (BLIS). All these texts are freely accessible from http://www.justice.gov.hk/.</Paragraph>
      <Paragraph position="1"> BLIS provides the most comprehensive documentation of HK legislation. It contains all statute laws of Hong Kong currently in operation, including all ordinances and subsidiary legislation of HK (and some of their past versions dating back to 60 June 1997), the Basic Law and the Sino-British Joint Declaration, the constitution of PRC and national laws that apply in HK, and other rel- null BLIS legal texts contains approximately 10 million English words and 18 million Chinese characters. Lexical resources of this kind are particularly useful in bilingual legal terminology studies and text alignment work.</Paragraph>
    </Section>
    <Section position="2" start_page="71" end_page="72" type="sub_section">
      <SectionTitle>
2.2 Text Hierarchy
</SectionTitle>
      <Paragraph position="0"> BLIS organizes the legal texts in terms of the hierarchy of the Loose-Leaf Edition of the Laws of Hong Kong. At the top level, the ordinances are arranged by chapters, each of which is identified by an assigned number and a short title, e.g., Chapter 5 OFFICIAL LANGUAGES ORDINANCE / 5cE&amp;quot;=!u. The assigned number for a subsidiary legislation chapter consists of a chapter number and a following uppercase letter, e.g.,</Paragraph>
      <Paragraph position="2"> The content of an ordinance, exclusive of its long title, is divided and identified according to a very rigid numbering system which encodes the hierarchy of the texts of the laws. Both the Chinese and English versions of an ordinance follow exactly the same hierarchical structures such as chapters (c), parts (), sections (!u), sub-sections ({), paragraphs ( ) and subparagraphs (). This allows us to align the bitexts along  this hierarchical structure, once they are downloaded from the BLIS official site. To our knowledge, a well-aligned bilingual corpus of this size covering a special domain so comprehensively is seldom readily available for the Chinese-English language pair.</Paragraph>
      <Paragraph position="3"> Excerpts from the BLIS corpus are illustrated in Figure 1 and 2, one illustrating its hierarchy and the other a pair of BLIS bitexts. From the excerpts we can see that not everything has an exact match between a pair of BLIS Web pages. For example, the Chinese side has a gazette number &amp;quot;25 of 1998 s. 2&amp;quot; and a piece of &amp;quot;remarks&amp;quot; at the beginning of content text, whereas its English counterpart has none of them.</Paragraph>
      <Paragraph position="4"> 3 Harvesting Bitexts from the Web Basically two phases are involved in constructing the bilingual corpus of the laws of HK. The first phase is to harvest the monolingual texts of HK laws from the BLIS site and align them into pairs. It involves the following steps: (1) downloading Web pages one by one with the aid of a Web crawler, (2) extracting the texts from them by filtering out the HTML markup, and (3) aligning the extracted monolingual texts into bilingual  ble linked lists pairs. The second phase is to align finer-grained text structures within each text pair.</Paragraph>
    </Section>
    <Section position="3" start_page="72" end_page="73" type="sub_section">
      <SectionTitle>
3.1 Downloading BLIS Web Pages
</SectionTitle>
      <Paragraph position="0"> A BLIS Web page does not necessarily correspond to any particular text structure such as a chapter, a part, a section, a subsection, or a paragraph in the BLIS hierarchy. A chapter, especially a short one, may be organized into a few sections in a Web page or in several contiguous pages. Some sections, e.g., the long ones, are divided into several pages. In general, BLIS does not maintain any reliable match between its Web pages and any particular text hierarchical structures. null Fortunately, in most cases a BLIS page always has a counterpart in the other language. There is a &amp;quot;switch language&amp;quot; button on each page to link to the counterpart page. Such linkage allows us to download the Web pages in pairs and, consequently, harvest a list of page-to-page aligned bitexts. null In addition to the pair link, each BLIS page also carries links for the &amp;quot;next&amp;quot; and the &amp;quot;previous section of enactment&amp;quot;. These two kinds of linkage turn the pages into two double linked lists, each in a language, as illustrated in Figure 3, with each page as a node. Nodes in pairs are also double linked between the two lists.</Paragraph>
      <Paragraph position="1"> However, the pairwise linkage is not reliable in the BLIS site, because there are missing Web pages in one of the two languages in question (see Table 3 below for more details). In order to download all bitexts of legislation from the site, we need to go through one linked list and download each page and its counterpart, if there is one, in the other language. Such scanning gives a list of text pairs, where some pages may have a null</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="73" end_page="75" type="metho">
    <SectionTitle>
BLIS numbering
</SectionTitle>
    <Paragraph position="0"> counterpart. An alternative strategy is to download each list separately, and then match the pages into pairs sequentially with the aid of numbering information in the header of each page - see 3.2 below. These two strategies verify one another, making sure that all pages are downloaded and put in the right pairs.</Paragraph>
    <Paragraph position="1"> The downloading is carried out by a Web crawler implemented in Java. In order to accomplish the above strategies, it also has to handle a number of technical issues.</Paragraph>
    <Paragraph position="2"> * It sleeps for a while (e.g., 10 seconds) when it finishes downloading a certain number of pages (e.g., 50 pages), because the BLIS site refuses continuous access from one site for a too long time.</Paragraph>
    <Paragraph position="3"> * When an error occurs, it remembers the current URL. Then it re-starts from where it stops.</Paragraph>
    <Paragraph position="4"> The data about the file downloading from BLIS site is given in Table 1. One can conceive that if the time intervals for sleep and downloading could be automatically tuned by the crawler to maximize the downloading efficiency, it would get the job done significantly more quickly. Our option for 10 seconds sleep between every 50 files is based on error records of a number of test runs.</Paragraph>
    <Section position="1" start_page="73" end_page="73" type="sub_section">
      <SectionTitle>
3.2 Aligning Web Pages
</SectionTitle>
      <Paragraph position="0"> Every BLIS Web page is identified by a subtitle that carries numbering information about the page, as illustrated in Figure 1. Such a subtitle is exactly retained in the page as its HTML title.</Paragraph>
      <Paragraph position="1">  This feature is utilized to align BLIS pages: all downloaded files are named in terms of the numbering information extracted from their HTML titles, as illustrated in Table 2. Consequently, all files are naturally aligned in pairs by their names. Any file names not in a pair indicate the missing counterparts in the other language. The statistics of file alignment are given in Table 3.</Paragraph>
    </Section>
    <Section position="2" start_page="73" end_page="73" type="sub_section">
      <SectionTitle>
3.3 Text Extraction
</SectionTitle>
      <Paragraph position="0"> Basically, this task involves two aspects, namely, filtering HTML markup and extracting content text. A straightforward strategy is that we first clean up HTML tags in each page and then the non-legal content. The tags are in brackets, and non-legal content in a consistent pattern throughout all BLIS pages. However, a more convenient way to do it is to make use of a reliable feature in the BLIS pages: legal content is placed in between two - the only two - horizontal bars in each page. Accordingly, we implement a strategy to first extract every thing in between the two bars and then clean up remaining HTML tags. The output from this procedure includes * a header as a fixed set of items, including chapter number, title, heading, etc., and * a piece of content text as a list of numbered items each in a line. (See the header and content text in Figure 2.) The text in a BLIS page is displayed as a sequence of hierarchically numbered items, such as subsections, paragraphs and subparagraphs.</Paragraph>
    </Section>
    <Section position="3" start_page="73" end_page="74" type="sub_section">
      <SectionTitle>
3.4 Text Alignment within Text Pairs
</SectionTitle>
      <Paragraph position="0"> After page (or file) alignment, each page finds its counterpart in the other language. After text extraction, a page gives a content text consisting of a list of numbered items, each in a line. A such  (1) All Ordinances shall be enacted and published in both official languages.// (2) Nothing in subsection (1) shall require an Ordinance to be enacted and published in both official languages where that Ordinance amends another Ordinance and-// (a) that other Ordinance was enacted in the English language only; and// (b) no authentic text of that Ordinance has been published in the Chinese language under section 4B(1).// (3) Nothing in subsection (1) shall require an Ordinance to be enacted and published in both official languages where the Chief Executive in Council- (Amended 26 of 1999 s.3)// aIndicating a text line break.</Paragraph>
      <Paragraph position="1">  ries similar lines, if no missing line in any page of the pair.</Paragraph>
      <Paragraph position="2"> Unfortunately, missing lines are found in some BLIS pages, as exemplified in Figure 2. There is no guarantee that matching text lines one by one in sequence would carry out the expected alignment within a page pair. However, the numbering items at the beginning of each line can be utilized as anchors to facilitate the alignment. The strategy along this line is given as follows. 1. Anchor identification: numbering items at the beginning of each line are recognized as anchors, with the beginning and the end of the whole content text as two special anchors, resulting in a list of anchors for each page;  2. Anchor alignment: match the two lists of anchors sequentially. If a pair of anchors does not match, give up the smaller one (in terms of the BLIS numbering hierarchy) and move on to the next possible pair, working in exactly the same procedure as matching identical anchor pairs between two sorted lists of anchors.</Paragraph>
      <Paragraph position="3"> 3. Text line alignment: a pair of matched anchors give a pair of matched lines; an unmatched anchor indicates a missing line in the other language.</Paragraph>
      <Paragraph position="4"> 4 XML Markup for the Aligned Corpus XML is applied to encode the text alignment outcomes output from the above alignment procedure. It has been a standard for data repre null sentation and exchange on the Web, and also accepted by the NLP community as a standard for linguistic data annotation and representation (Ide et al., 2000; Mengel and Lezius, 2000; Kim et al., 2001). There are a series of yearly NLPXML workshops for it since 2001. It provides a platform-independent flexible and sophisticated plain text format for data encoding and manipulation. It is particularly suitable for hierarchical linguistic data such as the hierarchicallyaligned bilingual corpus that we have produced. What's more, converting data to XML format not only significantly reduces the complexity of data exchange among different computer systems but also enhances data transmission reliability and eases Web browsing.</Paragraph>
      <Paragraph position="5"> There have been many corpora that are annotated with XML, e.g., HCRC Map Task Corpus (Anderson et al., 1991), American National Corpus (Ide and Macleod, 2001), the La Republica corpus (Baroni et al., 2004). Below we present the XML schema for our subparagraph-aligned BLIS bitexts, with sample annotation, and necessary Web browsing.</Paragraph>
    </Section>
    <Section position="4" start_page="74" end_page="74" type="sub_section">
      <SectionTitle>
4.1 XML Schema
</SectionTitle>
      <Paragraph position="0"> The current version of the XML schema for the bilingual BLIS corpus, as given in Figure 4, focuses on encoding all text structures in the BLIS hierarchy, including all elements in each BLIS Web page. It is to be extended to cover finer-grained structures such as clauses, phrases and words, as we proceed to align the BLIS bitexts at these linguistic levels. For simplicity, we allow para to subsume all types of text line, be they a section, subsection, paragraph or subparagraph. The annotation of a sample bitext with this schema is illustrated in Figure 5. Annotation of this kind is carried out by a Java program automatically for the entire bitext corpus.</Paragraph>
    </Section>
    <Section position="5" start_page="74" end_page="75" type="sub_section">
      <SectionTitle>
4.2 Corpus Browsing
</SectionTitle>
      <Paragraph position="0"> A number of display modes are designed for browsing the subparagraph-aligned bitexts, including bilingual modes and monolingual modes.</Paragraph>
      <Paragraph position="1">  In a bilingual mode, text line pairs are displayed in sequence. Switch of language order or from one mode to another is allowed any time during browsing. The bilingual display mode is illustrated in Figure 6.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML