File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/p06-1125_metho.xml

Size: 22,823 bytes

Last Modified: 2025-10-06 14:10:22

<?xml version="1.0" standalone="yes"?>
<Paper uid="P06-1125">
  <Title>Sydney, July 2006. c(c)2006 Association for Computational Linguistics A Phonetic-Based Approach to Chinese Chat Text Normalization</Title>
  <Section position="5" start_page="994" end_page="994" type="metho">
    <SectionTitle>
3 Source Channel Model and Problems
</SectionTitle>
    <Paragraph position="0"> The source channel model is implemented as baseline method in this work for chat term normalization. We brief its methodology and problems as follows.</Paragraph>
    <Section position="1" start_page="994" end_page="994" type="sub_section">
      <SectionTitle>
3.1 The Model
</SectionTitle>
      <Paragraph position="0"> The source channel model (SCM) is a successful statistical approach in speech recognition and machine translation (Brown, 1990). SCM is deemed applicable to chat term normalization due to similar task nature. In our case, SCM aims to find the character string</Paragraph>
      <Paragraph position="2"> also maximize )()|( CpCTp . Now )|( TCp is decomposed into two components, i.e. chat term translation observation model )|( CTp and language model )(Cp . The two models can be both estimated with maximum likelihood method using the trigram model in NIL corpus.</Paragraph>
    </Section>
    <Section position="2" start_page="994" end_page="994" type="sub_section">
      <SectionTitle>
3.2 Problems
</SectionTitle>
      <Paragraph position="0"> Two problems are notable in applying SCM in chat term normalization. First, data sparseness problem is serious because timely chat language corpus is expensive thus small due to dynamic nature of chat language. NIL corpus contains only 12,112 pieces of chat text created in eight months, which is far from sufficient to train the chat term translation model. Second, training effectiveness is poor due to the dynamic nature.</Paragraph>
      <Paragraph position="1"> Trained on static chat text pieces, the SCM approach would perform poorly in processing chat text in the future. Robustness on dynamic chat text thus becomes a challenging issue in our research. null Updating the corpus with recent chat text constantly is obviously not a good solution to the above problems. We need to find some information beyond character to help addressing the sparse data problem and dynamic problem. Fortunately, observation on chat terms provides us convincing evidence that the underlying phonetic mappings exist between most chat terms and their standard counterparts. The phonetic mappings are found promising in resolving the two problems.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="994" end_page="996" type="metho">
    <SectionTitle>
4 Phonetic Mapping Model
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="994" end_page="994" type="sub_section">
      <SectionTitle>
4.1 Definition of Phonetic Mapping
</SectionTitle>
      <Paragraph position="0"> Phonetic mapping is the bridge that connects two Chinese characters via phonetic transcription, i.e.</Paragraph>
      <Paragraph position="1"> Chinese Pinyin in our case. For example, &amp;quot;Jie [?][?][?][?]-[?] )56.0,,( jiezhe Zhe &amp;quot; is the phonetic mapping connecting &amp;quot;Zhe (this, zhe4)&amp;quot; and &amp;quot;Jie (interrupt, jie4)&amp;quot;, in which &amp;quot;zhe&amp;quot; and &amp;quot;jie&amp;quot; are Chinese Pinyin for &amp;quot;Zhe &amp;quot; and &amp;quot;Jie &amp;quot; respectively. 0.56 is phonetic similarity between the two Chinese characters.</Paragraph>
      <Paragraph position="2"> Technically, the phonetic mappings can be constructed between any two Chinese characters within any Chinese corpus. In chat language, any Chinese character can be used in chat terms, and phonetic mappings are applied to connect chat terms to their standard counterparts. Different from the dynamic character mappings, the phonetic mappings can be produced with standard Chinese corpus before hand. They are thus stable over time.</Paragraph>
    </Section>
    <Section position="2" start_page="994" end_page="995" type="sub_section">
      <SectionTitle>
4.2 Justifications on Phonetic Assumption
</SectionTitle>
      <Paragraph position="0"> To make use of phonetic mappings in normalization of chat language terms, an assumption must be made that chat terms are mainly formed via phonetic mappings. To justify the assumption, two questions must be answered. First, how many percent of chat terms are created via phonetic mappings? Second, why are the phonetic mapping models more stable than character mapping models in chat language?  ping type.</Paragraph>
      <Paragraph position="1"> To answer the first question, we look into chat term distribution in terms of mapping type in Table 2. It is revealed that 99.2 percent of chat terms in NIL corpus fall into the first four phonetic mapping types that make use of phonetic mappings. In other words, 99.2 percent of chat terms can be represented by phonetic mappings. 0.8% chat terms come from the OTHER type, emoticons for instance. The first question is undoubtedly answered with the above statistics. To answer the second question, an observation is conducted again on the five chat term sets described in Section 2.2. We create phonetic map- null pings manually for the 500 chat terms in each set. Then five phonetic mapping sets are obtained. They are in turn compared against the standard phonetic mapping set constructed with Chinese Gigaword. Percentage of phonetic mappings in each set covered by the standard set is presented in Table 3.</Paragraph>
      <Paragraph position="2">  each set covered by standard set.</Paragraph>
      <Paragraph position="3"> By comparing Table 1 and Table 3, we find that phonetic mappings remain more stable than character mappings in chat language text. This finding is convincing to justify our intention to design effective and robust chat language normalization method by introducing phonetic mappings to the source channel model. Note that about 1% loss in these percentages comes from chat terms that are not formed via phonetic mappings, emoticons for example.</Paragraph>
    </Section>
    <Section position="3" start_page="995" end_page="995" type="sub_section">
      <SectionTitle>
4.3 Formalism
</SectionTitle>
      <Paragraph position="0"> The phonetic mapping model is a five-tuple, i.e.</Paragraph>
      <Paragraph position="2"> which comprises of chat term character T , standard counterpart character C , phonetic transcription of T and C , i.e. )(Tpt and )(Cpt , and the mapping probability )|(Pr CT pm that T is mapped to C via the phonetic mapping  As they manage mappings between any two Chinese characters, the phonetic mapping models should be constructed with a standard language corpus. This results in two advantages. One, sparse data problem can be addressed appropriately because standard language corpus is used. Two, the phonetic mapping models are as stable as standard language. In chat term normalization, when the phonetic mapping models are used to represent mappings between chat term characters and standard counterpart characters, the dynamic problem can be addressed in a robust manner. Differently, the character mapping model used in the SCM (see Section 3.1) connects two Chinese characters directly. It is a three-tuple, i.e.  that T is mapped to C via this character mapping. As they must be constructed from chat language training samples, the character mapping models suffer from data sparseness problem and dynamic problem.</Paragraph>
    </Section>
    <Section position="4" start_page="995" end_page="996" type="sub_section">
      <SectionTitle>
4.4 Parameter Estimation
</SectionTitle>
      <Paragraph position="0"> Two questions should be answered in parameter estimation. First, how are the phonetic mapping space constructed? Second, how are the phonetic mapping probabilities estimated? To construct the phonetic mapping models, we first extract all Chinese characters from standard Chinese corpus and use them to form candidate character mapping models. Then we generate phonetic transcription for the Chinese characters and calculate phonetic probability for each candidate character mapping model. We exclude those character mapping models holding zero probability. Finally, the character mapping models are converted to phonetic mapping models with phonetic transcription and phonetic probability incorporated.</Paragraph>
      <Paragraph position="1"> The phonetic probability is calculated by combining phonetic similarity and character frequencies in standard language as follows.</Paragraph>
      <Paragraph position="3"> A is the character set in</Paragraph>
      <Paragraph position="5"> A is similar to character A in terms of phonetic transcription. )(cfr slc is a function returning frequency of given character c in standard language corpus and ),(  )(xinitial and )(xfinal return initial (shengmu) and final (yunmu) of given Chinese Pinyin x respectively. For example, Chinese Pinyin for the Chinese character &amp;quot;Zhe &amp;quot; is &amp;quot;zhe&amp;quot;, in which &amp;quot;zh&amp;quot; is initial and &amp;quot;e&amp;quot; is final. When initial or final is  empty for some Chinese characters, we only calculate similarity of the existing parts. An algorithm for calculating similarity of initial pairs and final pairs is proposed in (Li et al., 2003) based on letter matching. Problem of this algorithm is that it always assigns zero similarity to those pairs containing no common letter. For example, initial similarity between &amp;quot;ch&amp;quot; and &amp;quot;q&amp;quot; is set to zero with this algorithm. But in fact, pronunciations of the two initials are very close to each other in Chinese speech. So non-zero similarity values should be assigned to these special pairs before hand (e.g., similarity between &amp;quot;ch&amp;quot; and &amp;quot;q&amp;quot; is set to 0.8). The similarity values are agreed by some native Chinese speakers.</Paragraph>
      <Paragraph position="6"> Thus Li et al.'s algorithm is extended to output a pre-defined similarity value before letter matching is executed in the original algorithm. For example, Pinyin similarity between &amp;quot;chi&amp;quot; and &amp;quot;qi&amp;quot; is calculated as follows.</Paragraph>
      <Paragraph position="7"> 8.018.0),(),()( =x=x= iiSimqchSimchi,qiSim</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="996" end_page="996" type="metho">
    <SectionTitle>
5 Extended Source Channel Model
</SectionTitle>
    <Paragraph position="0"> We extend the source channel model by inserting phonetic mapping models</Paragraph>
    <Paragraph position="2"> three components are involved in XSCM, i.e.</Paragraph>
    <Paragraph position="3"> chat term normalization observation model ),|( CMTp , phonetic mapping model )|( CMp and language model )(Cp .</Paragraph>
    <Section position="1" start_page="996" end_page="996" type="sub_section">
      <SectionTitle>
Chat Term Normalization Observation
</SectionTitle>
      <Paragraph position="0"> Model. We assume that mappings between chat terms and their standard Chinese counterparts are independent of each other. Thus chat term normalization probability can be calculated as follows. null</Paragraph>
      <Paragraph position="2"> cmtp 's are estimated using maximum likelihood estimation method with Chinese character trigram model in NIL corpus.</Paragraph>
      <Paragraph position="3"> Phonetic Mapping Model. We assume that the phonetic mapping models depend merely on the current observation. Thus the phonetic mapping probability is calculated as follows.</Paragraph>
      <Paragraph position="5"> cmp 's are estimated with equation (2) and (3) using a standard Chinese corpus. Language Model. The language model )(Cp 's can be estimated using maximum likelihood estimation method with Chinese character trigram model on NIL corpus.</Paragraph>
      <Paragraph position="6"> In our implementation, Katz Backoff smoothing technique (Katz, 1987) is used to handle the sparse data problem, and Viterbi algorithm is employed to find the optimal solution in XSCM.</Paragraph>
    </Section>
  </Section>
  <Section position="8" start_page="996" end_page="998" type="metho">
    <SectionTitle>
6 Evaluation
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="996" end_page="996" type="sub_section">
      <SectionTitle>
6.1 Data Description
Training Sets
</SectionTitle>
      <Paragraph position="0"> Two types of training data are used in our experiments. We use news from Xinhua News Agency in LDC Chinese Gigaword v.2 (CNGIGA) (Graf et al., 2005) as standard Chinese corpus to construct phonetic mapping models because of its excellent coverage of standard Simplified Chinese. We use NIL corpus (Xia et al., 2006b) as chat language corpus. To evaluate our methods on size-varying training data, six chat language corpora are created based on NIL corpus. We select 6056 sentences from NIL corpus randomly to make the first chat language corpus, i.e. C#1. In every next corpus, we add extra 1,211 random sentences. So 7,267 sentences are contained in C#2, 8,478 in C#3, 9,689 in C#4, 10,200 in C#5, and 12,113 in C#6.</Paragraph>
    </Section>
    <Section position="2" start_page="996" end_page="997" type="sub_section">
      <SectionTitle>
Test Sets
</SectionTitle>
      <Paragraph position="0"> Test sets are used to prove that chat language is dynamic and XSCM is effective and robust in normalizing dynamic chat language terms. Six time-varying test sets, i.e. T#1 ~ T#6, are created in our experiments. They contain chat language sentences posted from August 2005 to Jan 2006.</Paragraph>
      <Paragraph position="1"> We randomly extract 1,000 chat language sentences posted in each month. So timestamp of the six test sets are in temporal order, in which time-stamp of T#1 is the earliest and that of T#6 the newest.</Paragraph>
      <Paragraph position="2"> The normalized sentences are created by hand and used as standard normalization answers.</Paragraph>
    </Section>
    <Section position="3" start_page="997" end_page="997" type="sub_section">
      <SectionTitle>
6.2 Evaluation Criteria
</SectionTitle>
      <Paragraph position="0"> We evaluate two tasks in our experiments, i.e.</Paragraph>
      <Paragraph position="1"> recognition and normalization. In recognition, we use precision (p), recall (r) and f-1 measure (f) defined as follows.</Paragraph>
      <Paragraph position="3"> where x denotes the number of true positives, y the false positives and z the true negatives.</Paragraph>
      <Paragraph position="4"> For normalization, we use accuracy (a), which is commonly accepted by machine translation researchers as a standard evaluation criterion.</Paragraph>
      <Paragraph position="5"> Every output of the normalization methods is compared to the standard answer so that normalization accuracy on each test set is produced.</Paragraph>
    </Section>
    <Section position="4" start_page="997" end_page="998" type="sub_section">
      <SectionTitle>
6.3 Experiment I: SCM vs. XSCM Using
Size-varying Chat Language Corpora
</SectionTitle>
      <Paragraph position="0"> In this experiment we investigate on quality of XSCM and SCM using same size-varying training data. We intend to prove that chat language is dynamic and phonetic mapping models used in XSCM are helpful in addressing the dynamic problem. As no standard Chinese corpus is used in this experiment, we use standard Chinese text in chat language corpora to construct phonetic mapping models in XSCM. This violates the basic assumption that the phonetic mapping models should be constructed with standard Chinese corpus. So results in this experiment should be used only for comparison purpose. It would be unfair to make any conclusion on general performance of XSCM method based on results in this experiments.</Paragraph>
      <Paragraph position="1"> We train the two methods with each of the six chat language corpora, i.e. C#1 ~ C#6 and test them on six time-varying test sets, i.e. T#1 ~ T#6. F-1 measure values produced by SCM and XSCM in this experiment are present in Table 3.</Paragraph>
      <Paragraph position="2"> Three tendencies should be pointed out according to Table 3. The first tendency is that f-1 measure in both methods drops on time-varying test sets (see Figure 1) using same training chat language corpora. For example, both SCM and XSCM perform best on the earliest test set T#1 and worst on newest T#4. We find that the quality drop is caused by the dynamic nature of chat language. It is thus revealed that chat language is indeed dynamic. We also find that quality of XSCM drops less than that of SCM. This proves that phonetic mapping models used in XSCM are helpful in addressing the dynamic problem.</Paragraph>
      <Paragraph position="3"> However, quality of XSCM in this experiment still drops by 0.05 on the six time-varying test sets. This is because chat language text corpus is used as standard language corpus to model the phonetic mappings. Phonetic mapping models constructed with chat language corpus are far from sufficient. We will investigate in Experiment-II to prove that stable phonetic mapping models can be constructed with real standard language corpus, i.e. CNGIGA.</Paragraph>
      <Paragraph position="4">  The second tendency is f-1 measure of both methods on same test sets drops when trained with size-varying chat language corpora. For example, both SCM and XSCM perform best on the largest training chat language corpus C#6 and worst on the smallest corpus C#1. This tendency reveals that both methods favor bigger training chat language corpus. So extending the chat language corpus should be one choice to improve quality of chat language term normalization.</Paragraph>
      <Paragraph position="5"> The last tendency is found on quality gap between SCM and XSCM. We calculate f-1 measure gaps between two methods using same training sets on same test sets (see Figure 2). Then the tendency is made clear. Quality gap between SCM and XSCM becomes bigger when test set  becomes newer. On the oldest test set T#1, the gap is smallest, while on the newest test set T#6, the gap reaches biggest value, i.e. around 0.09.</Paragraph>
      <Paragraph position="6"> This tendency reveals excellent capability of XSCM in addressing dynamic problem using the phonetic mapping models.</Paragraph>
    </Section>
    <Section position="5" start_page="998" end_page="998" type="sub_section">
      <SectionTitle>
6.4 Experiment II: SCM vs. XSCM Using
Size-varying Chat Language Corpora
and CNGIGA
</SectionTitle>
      <Paragraph position="0"> In this experiment we investigate on quality of SCM and XSCM when a real standard Chinese language corpus is incorporated. We want to prove that the dynamic problem can be addressed effectively and robustly when CNGIGA is used as standard Chinese corpus.</Paragraph>
      <Paragraph position="1"> We train the two methods on CNGIGA and each of the six chat language corpora, i.e. C#1 ~ C#6. We then test the two methods on six time-varying test sets, i.e. T#1 ~ T#6. F-1 measure values produced by SCM and XSCM in this experiment are present in Table 4.</Paragraph>
      <Paragraph position="2">  test sets with six chat language corpora and CNGIGA.</Paragraph>
      <Paragraph position="3"> Three observations are conducted on our results. First, according to Table 4, f-1 measure of SCM with same training chat language corpora drops on time-varying test sets, but XSCM produces much better f-1 measure consistently using CNGIGA and same training chat language corpora (see Figure 3). This proves that phonetic mapping models are helpful in XSCM method.</Paragraph>
      <Paragraph position="4"> The phonetic mapping models contribute in two aspects. On the one hand, they improve quality of chat term normalization on individual test sets. On the other hand, satisfactory robustness is achieved consistently.</Paragraph>
      <Paragraph position="5">  corpora and CNGIGA.</Paragraph>
      <Paragraph position="6"> The second observation is conducted on phonetic mapping models constructed with CNGIGA. We find that 4,056,766 phonetic mapping models are constructed in this experiment, while only 1,303,227 models are constructed with NIL corpus in Experiment I. This reveals that coverage of standard Chinese corpus is crucial to phonetic mapping modeling. We then compare two character lists constructed with two corpora. The 100 characters most frequently used in NIL corpus are rather different from those extracted from CNGIGA. We can conclude that phonetic mapping models should be constructed with a sound corpus that can represent standard language.</Paragraph>
      <Paragraph position="7"> The last observation is conducted on f-1 measure achieved by same methods on same test sets using size-varying training chat language corpora. Both methods produce best f-1 measure with biggest training chat language corpus C#6 on same test sets. This again proves that bigger training chat language corpus could be helpful to improve quality of chat language term normalization. One question might be asked whether quality of XSCM converges on size of the training chat language corpus. This question remains open due to limited chat language corpus available to us.</Paragraph>
    </Section>
  </Section>
  <Section position="9" start_page="998" end_page="999" type="metho">
    <SectionTitle>
6.5 Error Analysis
</SectionTitle>
    <Paragraph position="0"> Typical errors in our experiments belong mainly to the following two types.</Paragraph>
    <Paragraph position="1">  In this example, XSCM finds no chat term while the correct normalization answer is &amp;quot;Wo Huan Shi Bu Ming (I still don't understand)&amp;quot;. Error illustrated in Example-1 occurs when chat terms &amp;quot;8(eight, ba1)&amp;quot; and &amp;quot;Mi (meter, mi3)&amp;quot; appear in a chat sentence together. In chat language, &amp;quot;Mi &amp;quot; in some cases is used to replace &amp;quot;Ming (understand, ming2)&amp;quot;, while in other cases, it is used to represent a unit for length, i.e. meter. When number &amp;quot;8&amp;quot; appears before &amp;quot;Mi &amp;quot;, it is difficult to tell whether they are chat terms within sentential context. In our experiments, 93 similar errors occurred. We believe this type of errors can be addressed within discoursal context.</Paragraph>
    <Paragraph position="2"> Err.2 Chat terms created in manners other than phonetic mapping Example-2: You Lu ing In this example, XSCM does not recognize &amp;quot;ing&amp;quot; while the correct answer is &amp;quot;(Zheng Zai )You Lu (I'm worrying)&amp;quot;. This is because chat terms created in manners other than phonetic mapping are excluded by the phonetic assumption in XSCM method. Around 1% chat terms fall out of phonetic mapping types. Besides chat terms holding same form as showed in Example-2, we find that emoticon is another major exception type. Fortunately, dictionary-based method is powerful enough to handle the exceptions. So, in a real system, the exceptions are handled by an extra component.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML