XML Viewer - w06-0103

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/06/w06-0103_intro.xml
Size: 15,063 bytes
Last Modified: 2025-10-06 14:03:46
<?xml version="1.0" standalone="yes"?>
<Paper uid="W06-0103">
  <Title>Mining Atomic Chinese Abbreviation Pairs: A Probabilistic Model for Single Character Word Recovery</Title>
  <Section position="3" start_page="0" end_page="19" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> Chinese abbreviations are widely used in the modern Chinese texts. They are a special form of unknown words, which cannot be exhaustively enumerated in an ordinary dictionary. Many of them originated from important lexical units such as named entities. However, the sources for Chinese abbreviations are not solely from the noun class, but from most major categories, including verbs, adjectives adverbs and others. No matter what lexical or syntactic structure a string of characters could be, one can almost always find a way to abbreviate it into a shorter form. Therefore, it may be necessary to handle them beyond a class-based model. Furthermore, abbreviated words are semantically ambiguous. For example,&amp;quot;Qing Da &amp;quot; can be the abbreviation for&amp;quot;Qing Hua Da Xue &amp;quot;or&amp;quot;Qing Jie Da Dui &amp;quot;; on the opposite direction, multiple choices for abbreviating a word are also possible.</Paragraph>
    <Paragraph position="1"> For instance,&amp;quot;Tai Bei Da Xue &amp;quot;may be abbreviated as &amp;quot;Tai Da &amp;quot;,&amp;quot;Bei Da &amp;quot;or&amp;quot;Tai Bei Da &amp;quot;. This results in difficulty for correct Chinese processing and applications, including word segmentation, information retrieval, query expansion, lexical translation and much more. An abbreviation model or a large abbreviation lexicon is therefore highly desirable for Chinese language processing.</Paragraph>
    <Paragraph position="2"> Since the smallest possible Chinese lexical unit into which other words can be abbreviated is a single character, identifying the set of multi-character words which can be abbreviated into a single character is especially interesting.</Paragraph>
    <Paragraph position="3"> Actually, the abbreviation of a compound word can often be acquired by the principle of composition. In other words, one can decompose a compound word into its constituents and then concatenate their single character equivalents to form its abbreviated form. The reverse process to predict the unabbreviated form from an abbreviation shares the same compositional property.</Paragraph>
    <Paragraph position="4"> The Chinese abbreviation problem can be regarded as an error recovery problem in which the suspect root words are the &amp;quot;errors&amp;quot; to be recovered from a set of candidates. Such a problem can be mapped to an HMM-based generation model for both abbreviation identification and root word recovery; it can also  be integrated as part of a unified word segmentation model when the input extends to a complete sentence. As such, we can find the most likely root words, by finding those candidates that maximizes the likelihood of the whole text. An abbreviation lexicon, which consists of the root-abbreviation pairs, can thus be constructed automatically.</Paragraph>
    <Paragraph position="5"> In a preliminary study (Chang and Lai, 2004), some probabilistic models had been developed to handle this problem by applying the models to a parallel corpus of compound words and their abbreviations, without knowing the context of the abbreviation pairs. In this work, the same framework is extended and a method is proposed to automatically acquire a large abbreviation lexicon for indivisual characters from web texts or large corpora, instead of building abbreviation models based on aligned abbreviation pairs of short compound words. Unlike the previous task, which trains the abbreviation model parameters from a list of known abbreviation pairs, the current work aims at extracting abbreviation pairs from a corpus of free text, in which the locations of prospective abbreviations and full forms are unknown and the correspondence between them is not known either.</Paragraph>
    <Paragraph position="6"> In particular, a Single Character Recovery (SCR) Model is exploited in the current work to extract&amp;quot;atomic abbreviation pairs&amp;quot;from a large text corpus. With only a few training iterations, the acquisition accuracy achieves 62% and 50 % precision for training set and test set from the ASWSC-2001 corpus.</Paragraph>
    <Section position="1" start_page="17" end_page="18" type="sub_section">
      <SectionTitle>
1.1 Chinese Abbreviation Problems
</SectionTitle>
      <Paragraph position="0"> The modern Chinese language is a highly abbreviated one due to the mixed uses of ancient single character words as well as modern multi-character words and compound words.</Paragraph>
      <Paragraph position="1"> The abbreviated form and root form are used interchangeably everywhere in the current Chinese articles. Some news articles may contain as high as 20% of sentences that have suspect abbreviated words in them (Lai, 2003).</Paragraph>
      <Paragraph position="2"> Since abbreviations cannot be enumerated in a dictionary, it forms a special class of unknown words, many of which originate from named entities. Many other open class words are also abbreviatable. This particular class thus introduces complication for Chinese language processing, including the fundamental word segmentation process (Chiang et al., 1992; Lin et al., 1993; Chang and Su, 1997) and many word-based applications. For instance, a keyword-based information retrieval system may requires the two forms, such as&amp;quot;Zhong Yan Yuan &amp;quot;and &amp;quot;Zhong Yang Yan Jiu Yuan &amp;quot;(&amp;quot;Academia Sinica&amp;quot;), in order not to miss any relevant documents. The Chinese word segmentation process is also significantly degraded by the existence of unknown words (Chiang et al., 1992), including unknown abbreviations.</Paragraph>
      <Paragraph position="3"> There are some heuristics for Chinese abbreviations. Such heuristics, however, can easily break (Sproat, 2002). Unlike English abbreviations, the abbreviation process of the Chinese language is a very special word formation process. Almost all characters in all positions of a word can be omitted when used for forming an abbreviation of a compound word.</Paragraph>
      <Paragraph position="4"> For instance, it seems that, by common heuristics,&amp;quot;most&amp;quot;Chinese abbreviations could be derived by keeping the first characters of the constituent words of a compound word, such as transforming'Tai Wan Da Xue 'into'Tai Da ','Qing Hua Da Xue 'into'Qing Da 'and'Yi Se Lie ( Ji ) Ba Le Si Tan 'into 'Yi Ba '. Unfortunately, it is not always the case. For example, we can transform'Tai Wan Xiang Gang 'into 'Tai Gang ','Zhong Guo Shi You 'into'Zhong You ', and, for very long compounds like'Yun Lin Jia Yi Tai Nan 'into'Yun Jia Nan '(Sproat, 2002). Therefore, it is very difficult to predict the possible surface forms of Chinese abbreviations and to guess their base (non-abbreviated) forms heuristically.</Paragraph>
      <Paragraph position="5">  The high frequency abbreviation patterns revealed in (Chang and Lai, 2004) further break the heuristics quantitatively. Table 1 lists the distribution of the most frequent abbreviation patterns for word of length 2~8 characters.</Paragraph>
      <Paragraph position="6"> The table indicates which characters will be deleted from the root of a particular length (n) with a bit '0'; on the other hand, a bit '1' means that the respective character will be retained.</Paragraph>
      <Paragraph position="7"> This table does support some general heuristics for native Chinese speaker quantitatively. For instance, there are strong supports that the first character in a two-character word will be retained in most cases, and the first and the third characters in a 4-character word will be retained in 56% of the cases. However, the table also shows that around 50% of the cases cannot be uniquely determined by character position simply by consulting the word length of the un-abbreviated form. This does suggest the necessity of either an abbreviation model or a large abbreviation lexicon for resolving this kind of unknown words and named entities.</Paragraph>
      <Paragraph position="8"> There are also a large percentage (312/1547) of &amp;quot;tough&amp;quot; abbreviation patterns (Changand Lai, 2004), which are considered &amp;quot;tough&amp;quot;in the sense that they violate some simple assumptions, and thus cannot be modeled in a simple way. For instance, some tough words will actually be recursively abbreviated into shorter and shorter lexical forms; and others may change the word order (as in abbreviating&amp;quot;Di [?] He Neng Fa Dian Chang &amp;quot;as &amp;quot;He [?] Chang &amp;quot;instead of&amp;quot;[?] He Chang &amp;quot;.). As a result, the abbreviation process is much more complicated than a native Chinese speaker might think.</Paragraph>
    </Section>
    <Section position="2" start_page="18" end_page="18" type="sub_section">
      <SectionTitle>
1.2 Atomic Abbreviation Pairs
</SectionTitle>
      <Paragraph position="0"> Since the abbreviated words are created continuously through the abbreviation of new (mostly compound) words, it is nearly impossible to construct a complete abbreviation lexicon. In spite of the difficulty, it is interesting to note that the abbreviation process for Chinese compound words seems to be&amp;quot;compositional&amp;quot;. In other words, one can often decode an abbreviated word, such as&amp;quot;Tai Da &amp;quot;(&amp;quot;Taiwan University&amp;quot;), character-by-character back to its root form&amp;quot;Tai Wan Da Xue &amp;quot;by observing that&amp;quot;Tai &amp;quot; can be an abbreviation of&amp;quot;Tai Wan &amp;quot;and&amp;quot;Da &amp;quot;can be an abbreviation of&amp;quot;Da Xue &amp;quot;and&amp;quot;Tai Wan Da Xue &amp;quot;is a frequently observed character sequence.</Paragraph>
      <Paragraph position="1"> Since character is the smallest lexical unit for Chinese, no further abbreviation into smaller units is possible. We therefore use&amp;quot;atomic abbreviation pair&amp;quot;to refer to an abbreviated word and its root word (i.e., unabbreviated form) in which the abbreviation is a single Chinese character.</Paragraph>
      <Paragraph position="2"> On the other hand, abbreviations of multi-character compound words may be synthesized from single characters in the &amp;quot;atomic abbreviation pairs&amp;quot;. If we are able to identify all such&amp;quot;atomic abbreviation pairs&amp;quot;, where the abbreviation is a single character, and construct such an atomic abbreviation lexicon, then resolving multiple character abbreviation problems, either by heuristics or by other abbreviation models, might become easier.</Paragraph>
      <Paragraph position="3"> Furthermore, many ancient Chinese articles are composed mostly of single-character words.</Paragraph>
      <Paragraph position="4"> Depending on the percentage of such single-character words in a modern Chinese article, the article will resemble to an ancient Chinese article in proportional to such a percentage. As another application, an effective single character recovery model may therefore be transferred into an auxiliary translation system from ancient Chinese articles into their modern versions. This is, of course, an overly bold claim since lexical translation is not the only factor for such an application. However, it may be consider as a possible direction for lexical translation when constructing an ancient-to-modern article translation system.</Paragraph>
      <Paragraph position="5"> Also, when a model for recovering atomic translation pair is applied to the&amp;quot;single character regions&amp;quot;of a word segmented corpus, it is likely to recover unknown abbreviated words that are previously word-segmented incorrectly into individual characters.</Paragraph>
      <Paragraph position="6"> An HMM-based Single Character Recovery (SCR) Model is therefore proposed in this paper to extract a large set of&amp;quot;atomic abbreviation pairs&amp;quot;from a large text corpus.</Paragraph>
    </Section>
    <Section position="3" start_page="18" end_page="19" type="sub_section">
      <SectionTitle>
1.3 Previous Works
</SectionTitle>
      <Paragraph position="0"> Currently, only a few quantitative approaches (Huang et al., 1994a; Huang et al., 1994b) are available in predicting the presence of an abbreviation. There are essentially no prior arts for automatically extracting atomic abbreviation pairs. Since such formulations regard the word segmentation process and abbreviation  identification as two independent processes, they probably cannot optimize the identification process jointly with the word segmentation process, and thus may lose some useful contextual information. Some class-based segmentation models (Sun et al., 2002; Gao et al., 2003) well integrate the identification of some regular non-lexicalized units (such as named entities). However, the abbreviation process can be applied to almost all word forms (or classes of words). Therefore, this particular word formation process may have to be handled as a separate layer in the segmentation process. To resolve the Chinese abbreviation problems and integrate its identification into the word segmentation process, (Chang and Lai, 2004) proposes to regard the abbreviation problem in the word segmentation process as an &amp;quot;error recovery&amp;quot; problem in which the suspect root words are the &amp;quot;errors&amp;quot; to be recovered from a set of candidates according to some generation probability criteria. This idea implies that an HMM-based model for identifying Chinese abbreviations could be effective in either identifying the existence of an abbreviation or the recovery of the root words from an abbreviation.</Paragraph>
      <Paragraph position="1"> Since the parameters of an HMM-like model can usually be trained in an unsupervised manner, and the &amp;quot;output probabilities&amp;quot; known to the HMM framework will indicate the likelihood for an abbreviation to be generated from a root candidate, such a formulation can easily be adapted to collect highly probable root-abbreviation pairs. As a side effect of using HMM-based formulation, we expect that a large abbreviation dictionary could be derived automatically from a large corpus or from web documents through the training process of the unified word segmentation model.</Paragraph>
      <Paragraph position="2"> In this work, we therefore explore the possibility of using the theories in (Chang and Lai, 2004) as a framework for constructing a large abbreviation lexicon consisting of all Chinese characters and their potential roots. In the following section, the HMM models as outlined in (Chang and Lai, 2004) is reviewed first. We then described how to use this framework to construct an abbreviation lexicon automatically. In particular, a Single Character Recovery (SCR) Model is exploited for extracting possible root (un-abbreviated) forms for all Chinese characters.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML