File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/06/p06-1125_intro.xml
Size: 3,175 bytes
Last Modified: 2025-10-06 14:03:37
<?xml version="1.0" standalone="yes"?> <Paper uid="P06-1125"> <Title>Sydney, July 2006. c(c)2006 Association for Computational Linguistics A Phonetic-Based Approach to Chinese Chat Text Normalization</Title> <Section position="4" start_page="993" end_page="994" type="intro"> <SectionTitle> 2 Feature Analysis and Evidences </SectionTitle> <Paragraph position="0"> Observation on NIL corpus discloses the anomalous and dynamic features of chat language.</Paragraph> <Section position="1" start_page="993" end_page="993" type="sub_section"> <SectionTitle> 2.1 Anomalous </SectionTitle> <Paragraph position="0"> Chat language is explicitly anomalous in two aspects. Firstly, some chat terms are anomalous entries to standard dictionaries. For example, &quot;Jie Li (here, jie4 li3)&quot; is not a standard word in any contemporary Chinese dictionary while it is often used to replace &quot;Zhe Li (here, zhe4 li3)&quot; in chat language. Secondly, some chat terms can be found in standard dictionaries while their meanings in chat language are anomalous to the dictionaries. For example, &quot;Ou (even, ou3)&quot; is often used to replace &quot;Wo (me, wo2)&quot; in chat text. But the entry that &quot;Ou &quot; occupies in standard dictionary is used to describe even numbers. The latter case is constantly found in chat text, which makes chat text understanding fairly ambiguous because it is difficult to find out whether these terms are used as standard words or chat terms.</Paragraph> </Section> <Section position="2" start_page="993" end_page="994" type="sub_section"> <SectionTitle> 2.2 Dynamic </SectionTitle> <Paragraph position="0"> Chat text is deemed dynamic due to the fact that a large proportion of chat terms used in last year may become obsolete in this year. On the other hand, ample new chat terms are born. This feature is not as explicit as the anomalous nature.</Paragraph> <Paragraph position="1"> But it is as crucial. Observation on chat text in NIL corpus reveals that chat term set changes along with time very quickly.</Paragraph> <Paragraph position="2"> An empirical study is conducted on five chat text collections extracted from YESKY BBS system (bbs.yesky.com) within different time periods, i.e. Jan. 2004, July 2004, Jan. 2005, July 2005 and Jan. 2006. Chat terms in each collection are picked out by hand together with their frequencies so that five chat term sets are obtained. The top 500 chat terms with biggest frequencies in each set are selected to calculate reoccurring rates of the earlier chat term sets on the later ones.</Paragraph> <Paragraph position="3"> represent the earlier chat term sets and the columns the later ones.</Paragraph> <Paragraph position="4"> The surprising finding in Table 1 is that 29.4% of chat terms are replaced with new ones within two years and about 18.5% within one year. The changing speed is much faster than that in standard language. This thus proves that chat text is dynamic indeed. The dynamic nature renders the static corpus outdated quickly. It poses a challenging issue on chat language processing.</Paragraph> </Section> </Section> class="xml-element"></Paper>