File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/02/w02-1802_metho.xml

Size: 14,492 bytes

Last Modified: 2025-10-06 14:08:09

<?xml version="1.0" standalone="yes"?>
<Paper uid="W02-1802">
  <Title>Some Considerations on Guidelines for Bilingual Alignment and Terminology Extraction</Title>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2. High Quality Terminology Alignment
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
and Extraction
2.1 Bilingual Legal Terminology in Hong
Kong
</SectionTitle>
      <Paragraph position="0"> The implementation of a bilingual legal system in Hong Kong as a result of the return of sovereignty to China in 1997 has given rise to a need for the creation and standardization of Chinese legal terminology of the Common Law on par with the English one. The standardization of legal terminology will not only facilitate the mandated wider use of Chinese among legal professionals in various legal practices such as trials and production of legal documentation involving bilingual laws and judgments, but also promote greater consistency of semantic reference of terminology to minimize ambiguity and to avoid confusion of interpretation in legal argumentation.</Paragraph>
      <Paragraph position="1"> In the early 90's, Hong Kong law drafters and legal translation experts undertook the unprecedented task of translating Hong Kong Laws, which are based on the Common Law system, from English into Chinese. In the process, many new Chinese legal terms for the Common Law were introduced. On this basis, an English-Chinese Glossary of legal terms and a Chinese-English Glossary were published in 1995 and 1999 respectively. The legal terminology was vetted by the high level Bilingual Laws Advisory Committee (BLAC) of Hong Kong. The glossaries which contain about 30,000 basic entries have become an important reference for Chinese legal terms in Hong Kong. The Bilingual Legal Information System (BLIS) developed by the Department of Justice, HKSAR provides simple keyword search for the glossaries and laws that are available in both Chinese and English. Nevertheless, the glossaries are far from being adequate for many different types of legal documentation, e.g. contracts, court judgments, etc. One major limitation of the BLIS glossary is its restricted coverage of legal terminology in the Laws of Hong Kong, within a basically prescriptive context as when the laws were studied at the time of its promulgation. There are other important bilingual references (Li and Poon 1998, Yiu and Au-Yeung 1992, Yiu and Cheung 1996) which focus more on the translation of Common Law concepts. These are almost exclusively nominal expressions.</Paragraph>
      <Paragraph position="2"> In 2000, the City University of Hong Kong, in cooperation with the Judiciary, HKSAR, initiated a research project to develop a bilingual text retrieval system, Electronic Legal Documentation/Corpus System (ELDoS), which is supported by a bilingually aligned corpus of judgments. The purpose of the on-going project is twofold. First, the aligned legal corpus enables the retrieval of legal terms used in authentic contexts where the essence and spirit of the laws are tested (and contested) in reality, explicated and elaborated on, as an integral part of the evolving and defining body of important precedent cases unique to the Common Law tradition. Second, the corpus covers judgment texts involving interpretation of different language styles and vocabulary from Hong Kong laws. The alignment markup also serves as the basis for the compilation of a high-quality bilingual legal term bank. To complete the task within the tight timeframe, a team of annotators highly trained in law and language are involved in alignment markup and related editing.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.2 Need for Human Input
</SectionTitle>
      <Paragraph position="0"> The legal professionals which are the target users of ELDoS have very stringent demands on terminology in terms of accuracy, coverage and consistency. Aligned texts and extracted terms must therefore be carefully and thoroughly verified manually to minimize errors.</Paragraph>
      <Paragraph position="1"> Furthermore, many studies on terminology alignment and extraction deal predominantly with nominal expressions. Since the project aims to provide comprehensive information on the manifestations of legal vocabulary in Chinese and English texts, the retrieval system should not restrict users to nominal expressions but should also provide reference to many other phenomena such as alternation of part-of-speech (POS) (e.g. noun-verb alternation) inherent in bilingual texts, as will be seen in Section 3.</Paragraph>
      <Paragraph position="2"> The availability of bilingual corpora has made it possible to construct representative term banks. Nonetheless, current alignment and term extraction technology are still considered insufficient to meet the requirements for high quality terminology extraction. In ELDoS project, many issues are difficult to be handled satisfactorily by the computer in the foreseeable future. Although human input is essential for high quality term bank construction, the practice of manual intervention is not straightforward.</Paragraph>
      <Paragraph position="3"> Indeed, the manual efforts to correct the errors can be substantial, and the associated cost should not be underestimated. The annotator must first go through the entire texts to spot the errors and terms left out by the machines. In this process, both the source and target materials have to be consulted. The annotator must also ensure the consistency of the output. As a result, guidelines should be set up to streamline the process.</Paragraph>
      <Paragraph position="4"> 3. Aspects of Terminology Alignment The approach adopted for the manual annotation of alignment markup and the maintenance of term bank in the ELDoS project will be described.</Paragraph>
      <Paragraph position="5"> Additional caution has been taken in the coordination of a team of annotators.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.1 Term Frequency
</SectionTitle>
      <Paragraph position="0"> An important reason for manual intervention in bilingual term alignment is the relatively poor recall rate for low frequency terms. Many extraction algorithms make use of statistical techniques to identify multi-word strings that frequently co-occur (Wu and Xia 1995; Kwong and Tsou 2001). These methods are less effective for locating low frequency terms. Of the 16,000 terms extracted from ELDoS bilingual corpora, about 62% occur only once in about 80 judgments. For high quality alignment and extraction, failure to include these low frequency terms would be totally unacceptable.</Paragraph>
    </Section>
    <Section position="4" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.2 Correspondence of Aligned Units
</SectionTitle>
      <Paragraph position="0"> Because of the different grammatical requirement and language style, a term in the source language often differs in different ways from the corresponding manifestations in the target language. These differences could be alternation of POS and the use of paraphrastic expressions.</Paragraph>
      <Paragraph position="1"> Although many term banks avoid such variations and focus primarily on equivalent nominals or verbs, the correspondence of terms between two typologically different languages is often more complicated. For example, the English nominal (&amp;quot;fulfilment&amp;quot;) is more naturally translated into Chinese as a verb (&amp;quot;G6CG0CGE9&amp;quot;, &amp;quot;GB4GEAG0CGE9&amp;quot;, &amp;quot;G97GAA&amp;quot;). More examples can be found in Table 1.</Paragraph>
      <Paragraph position="2">  In some cases, there are simply no equivalent words in the target language. Paraphrasing or circumlocution may be necessary. Such correspondence is far less consistent and obvious to be identified by the computer.</Paragraph>
    </Section>
    <Section position="5" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
Paraphrasing/Circumlocution
English Chinese
</SectionTitle>
      <Paragraph position="0"> The judge entered judgment in favour of the respondents in respect of their claim for arrears of wages, and severance payment.</Paragraph>
      <Paragraph position="1">  Because of language differences, legal terms can be contextually realized as anaphors in the target language. Examples of such correspondence would be useful for legal drafting and translation. Again, such anaphoric relations are more accurately handled by humans.</Paragraph>
    </Section>
    <Section position="6" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.3 Discontinuous Units
</SectionTitle>
      <Paragraph position="0"> Most term extraction algorithms deal with contiguous units, e.g. n-gram. These algorithms would be problematical in handling discontinuous units. They include phrasal verbs (e.g. &amp;quot;strike out&amp;quot;), collocation patterns (e.g. &amp;quot;lodge three complaints&amp;quot;, &amp;quot;G6FG12...G9EG7CGABG45&amp;quot;). These have to be manually added or edited. Interestingly, our preliminary study shows that over 90% of the instances of discontinuous units are found in the Chinese manifestation of English terms. Some examples are listed in Table 4.</Paragraph>
    </Section>
    <Section position="7" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.4 Selective Markup
</SectionTitle>
      <Paragraph position="0"> To avoid producing &amp;quot;uninteresting&amp;quot; term alignment, restricting markup to only terms of the interested domain would be an attractive alternative to full-text alignment. In the ELDoS project, it is possible to mark up only legal terminology. Other non-legal elements can be omitted in alignment annotation. This approach has been accepted by the ELDoS client. Some examples of legal and non-legal terms are shown in Table 5.</Paragraph>
      <Paragraph position="1">  However, many other terms are more ambiguous. There is often no hard and fast rule to set criteria for domain membership. Annotators would have to rely on their own individual judgement to decide whether an expression should be counted as a legal term. For example, the English words listed in Table 6 are not used exclusively in the legal domain. However, taking into account their frequency, legal context and the multiple renditions in Chinese, they are worthy of being considered as &amp;quot;semi-legal.&amp;quot; What is interesting about &amp;quot;I&amp;quot; is that though the pronoun is a common pronoun, the corresponding Chinese manifestation &amp;quot;G21G82&amp;quot; is used exclusively in the judgments and should be regarded as legal. These examples suggest that the decision to classify a phrase as a legal term involves a great deal of complications.  Selective markup, however, could give rise to intra- and inter-annotator inconsistency. The vagueness of legal terms could lead to variation in the selection of the same term at different times and among different annotators. In ELDoS project, computer-aided markup tools that can instantly check candidate expressions against the term bank is an effective reference for annotators to maintain consistency. Those terms that are found in term bank should be included in the alignment. In this way, the term bank can serve as a working standard for annotators. As for new terms, our annotators have adopted the principle that whenever they have doubts as to domain membership of a new term, they should include the term in the alignment. In this way, all the candidate terms are guaranteed to be available for the term bank manager for final decision.</Paragraph>
      <Paragraph position="2"> Inter-annotator differences can also be reduced by fostering more communication among annotators such as regular review of peer work.</Paragraph>
    </Section>
    <Section position="8" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.5 Granularity
</SectionTitle>
      <Paragraph position="0"> Term granularity is another major issue not only for machines but also for humans. The terminology list should be as simple and compact as possible to avoid redundancy of entries. For example, instead of having &amp;quot;allegations&amp;quot;, &amp;quot;corruption&amp;quot;, &amp;quot;allegations of corruption&amp;quot;, &amp;quot;allegations of manslaughter&amp;quot; as separate entries, it is preferable to treat only &amp;quot;allegations&amp;quot;, &amp;quot;corruption&amp;quot; and &amp;quot;manslaughter&amp;quot; as glossary entries. The annotators have adopted the principle that a term should be a minimal semantic unit.</Paragraph>
      <Paragraph position="1"> Here &amp;quot;semantic unit&amp;quot; refers to single- or multi-word terms that have acquired specialized meaning or usage. For example, the phrase &amp;quot;great and general importance&amp;quot; GF9G55GD6G16G11GF9GDEGA4 has been used GF9G05GF8G04G08GF8G01G07GFFG0C as a frozen chunk, and should not be further divided into &amp;quot;great&amp;quot;, &amp;quot;and&amp;quot;, &amp;quot;general&amp;quot; and &amp;quot;importance&amp;quot;. Similarly, &amp;quot;oral decision&amp;quot; G51G70G12G32 refers to the verbal delivery of judgments in trial as opposed to written judgments. Such decisions involve the support of real-world knowledge and sophisticated semantic/pragmatic interpretation and are not easily modelled by the computer.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4. Further Works
</SectionTitle>
    <Paragraph position="0"> Bilingual terminology extracted directly from the bilingual corpora bear the form as it is in the text corpora. English words with different morphological markers will give rise to multiple entries in the resulting glossary. However, from the user's point of view, verbs with the same root but different inflectional markers (e.g. &amp;quot;hold&amp;quot;, &amp;quot;held&amp;quot;, &amp;quot;holding&amp;quot;) should be combined to form one single entry. Similarly, variants of Chinese expressions that differ simply by an optional markers G31 de (see Table 7) may better be treated as the same item to minimize redundancy.</Paragraph>
    <Paragraph position="1">  Term bank management tools will be developed to process the morphological markers and combine related pairs.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML