File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/i05-3013_metho.xml
Size: 16,382 bytes
Last Modified: 2025-10-06 14:09:34
<?xml version="1.0" standalone="yes"?> <Paper uid="I05-3013"> <Title>NIL Is Not Nothing: Recognition of Chinese Network Informal Language Expressions</Title> <Section position="3" start_page="95" end_page="96" type="metho"> <SectionTitle> 2 The Ways NIL Expressions Are Typi- cally Formed </SectionTitle> <Paragraph position="0"> NIL expressions were first introduced for expediting writing or computer input, especially for online chat where the input speed is crucial to prompt and effective communication. For example, it is rather annoying to input full Chinese sentences in text-based chatting environment, e.g. over the mobile phone. Thus abbreviations and acronyms are then created by forming words in capital with the first letters of a series of either English words or Chinese Pinyin.</Paragraph> <Paragraph position="1"> Chinese Pinyin is a popular approach to Chinese character input. Some Pinyin input methods incorporate lexical intelligence to support word or phrase input. This improves input rate greatly.</Paragraph> <Paragraph position="2"> However, Pinyin input is not error free. Firstly, options are usually prompted to user and selection errors result in homophone, e.g. &quot; e0 (ban1 zu2)&quot; and &quot; ( (ban1 zhu3)&quot;. Secondly, input with incorrect Pinyin or dialect produces wrong Chinese words with similar pronunciation, e.g. &quot;/OA (xi1 fan4)&quot; and &quot;p (xi3 huan1)&quot;. Nonetheless, prompt communication spares little time to user to correct such a mistake. The same mistake in text is constantly repeated, and the wrong word thus becomes accepted by the chat community. This, in fact, is one common way that a new Chinese NIL expression is created.</Paragraph> <Paragraph position="3"> We collect a large number of &quot;sentences&quot; (strictly speaking, not all of them are sentences) from a Chinese BBS system and identify NIL expressions by hand. An empirical study on NIL expressions in this collection shows that NIL expressions can be classified into four classes as follow based on their origins.</Paragraph> <Paragraph position="4"> 1) Abbreviation (A). Many Chinese NIL expressions are derived from abbreviation of Chinese Pinyin. For example, &quot;PF&quot; equals to &quot;= (pei4 fu2)&quot; which means &quot;admire&quot;.</Paragraph> <Paragraph position="5"> 2) Foreign expression (F). Popular Informal expressions from foreign languages such as English are adopted, e.g. &quot;ASAP&quot; is used for &quot;as soon as possible&quot;.</Paragraph> <Paragraph position="6"> 3) Homophone (H). A NIL expression is sometimes generated by borrowing a word with similar sound (i.e. similar Pinyin). For example &quot; /OA &quot; equals &quot; p &quot; which means &quot;like&quot;. &quot;/OA &quot; and &quot;p &quot; hold homophony in a Chinese dialect.</Paragraph> <Paragraph position="7"> 4) Transliteration (T) is a transcription from one alphabet to another and a letter-for-letter or sound-for-letter spelling is applied to represent a word in another language. For example, &quot; (bai4 bai4)&quot; is transliteration of &quot;bye-bye&quot;.</Paragraph> <Paragraph position="8"> A thorough observation, in turn, reveals that, based on the ways NIL expressions are formed and/or their part of speech (POS) attributes, we observe a NIL expression usually takes one of the forms presented in Table 1 and Table 2.</Paragraph> <Paragraph position="9"> The above empirical study is essential to NIL lexicography and feature definition.</Paragraph> </Section> <Section position="4" start_page="96" end_page="97" type="metho"> <SectionTitle> 3 Related Works </SectionTitle> <Paragraph position="0"> NIL expression recognition, in particular, can be considered as a subtask of information extraction (IE). Named entity recognition (NER) happens to hold similar objective with NIL expression recognition, i.e. to extract meaningful text segments from unstructured text according to certain pre-defined criteria.</Paragraph> <Paragraph position="1"> NER is a key technology for NLP applications such as IE and question & answering. It typically aims to recognize names for person, organization, location, and expressions of number, time and currency. The objective is achieved by employing either handcrafted knowledge or supervised learning techniques. The latter is currently dominating in NER amongst which the most popular methods are decision tree (Sekine et al., 1998; Pailouras et al., 2000), Hidden Markov Model (Zhang et al., 2003; Zhao, 2004), maximum entropy (Chieu and Ng, 2002; Bender et al., 2003), and support vector machines (Isozaki and Kazawa, 2002; Takeuchi and Collier, 2002; Mayfield, 2003).</Paragraph> <Paragraph position="2"> From the linguistic perspective, NIL expressions are rather different from named entities in nature. Firstly, named entity is typically noun or noun phrase (NP), but NIL expression can be any kind, e.g. number &quot;94&quot; in NIL represents &quot; &quot; which is a verb meaning &quot;exactly be&quot;. Secondly, named entities often have well-defined meanings in text and are tractable from a standard dictionary; but NIL expressions are either unknown to the dictionary or ambiguous. For example, &quot;/OA &quot; appears in conventional dictionary with the meaning of Chinese porridge, but in NIL text it represents &quot; p &quot; which surprisingly represents &quot;like&quot;. The issue that concerns us is that these expressions like &quot;/OA &quot; may also appear in NIL text with their formal meaning. This leads to ambiguity and makes it more difficult in NIL processing.</Paragraph> <Paragraph position="3"> Another notable work is the project of &quot;Normalization of Non-standard Words&quot; (Sproat et al., 2001) which aims to detect and normalize the &quot;Non-Standard Words (NSW)&quot; such as digit sequence; capital word or letter sequence; mixed case word; abbreviation; Roman numeral; URL and e-mail address. In our work, we consider most types of the NSW in English except URL and email address. Moreover, we consider Chinese NIL expressions that contain same characters as the normal words. For example, &quot; / OA &quot; and &quot;:E,Q &quot; both appear in common dictionaries, but they carry anomalous meanings in NIL text. Ambiguity arises and basically brings NIL expressions recognition beyond the scope of NSW detection.</Paragraph> <Paragraph position="4"> According to the above observations, we propose to employ the existing IE techniques to handle NIL expressions. Our goal is to develop a NIL expression recognition system to facilitate network-mediated communication. For this purpose, we first construct the required NIL knowledge resources, namely, a NIL dictionary and n-gram statistical features.</Paragraph> </Section> <Section position="5" start_page="97" end_page="98" type="metho"> <SectionTitle> 4 Knowledge Engineering </SectionTitle> <Paragraph position="0"> Recognition of NIL expressions relies on unconventional linguistic knowledge such as NIL dictionary and NIL features. We construct a NIL corpus and develop a knowledge engineering component to obtain these knowledge by running a knowledge mining tool on the NIL corpus. The knowledge mining tool is a text processing program that extracts NIL expressions and their attributes and contextual information, i.e. n-grams, from the NIL corpus. Workflow for this component is presented in Figure 1.</Paragraph> <Section position="1" start_page="97" end_page="97" type="sub_section"> <SectionTitle> 4.1 NIL Corpus </SectionTitle> <Paragraph position="0"> The NIL corpus is a collection of network informal sentences which provides training data for NIL dictionary and statistical NIL features. The NIL corpus is constructed by annotating a collection of NIL text manually.</Paragraph> <Paragraph position="1"> Obtaining real chat text is difficult because of the privacy restriction. Fortunately, we find BBS text within &quot; (da4 zui3 qu1)&quot; zone in YESKY system (http://bbs.yesky.com/bbs/) reflects remarkable colloquial characteristics and contains a vast amount of NIL expressions. We download BBS text posted from December 2004 and February 2005 in this zone. Sentences with NIL expressions are selected by human annotators, and NIL expressions are manually identified and annotated with their attributes. We finally collected 22,432 sentences including 451,193 words and 22,648 NIL expressions.</Paragraph> <Paragraph position="2"> The NIL expressions are marked up with SGML. The typical example, i.e. &quot;4?4?U &quot; in Section 1, is annotated as follows.</Paragraph> <Paragraph position="3"> where NILEX is the SGML tag to label a NIL expression, which entails NIL linguistic attributes including class, normal, pinyin, segments, pos, and posseg (see Section 4.2). H is a value of class (see Section 2). Value VERB demotes verb, ADJ adjective, NUM number and AUX auxiliary.</Paragraph> </Section> <Section position="2" start_page="97" end_page="97" type="sub_section"> <SectionTitle> 4.2 NIL Dictionary </SectionTitle> <Paragraph position="0"> The NIL dictionary is a structured databank that contains NIL expression entries. Each entity in turn entails nine attributes described as follow.</Paragraph> <Paragraph position="1"> 1. ID: an unique identification number for the NIL expression, e.g. 915800; 2. string: string of the NIL expression, e.g. &quot;4 ?4 &quot;; 3. class: class of the NIL expression (see Section 2), e.g. &quot;H&quot; for homophony; 4. pinyin: Chinese Pinyin for the NIL expression, e.g. &quot;xi4 ba1 xi4&quot;; 5. normal: corresponding normal text for the NIL expression, e.g. &quot; &quot;; 6. segments: word segments of the NIL expression, e.g. &quot;4 |? |4 &quot;; 7. pos: POS tag associated with the expression, e.g. &quot;VERB&quot; denoting a verb; 8. posseg: a POS tag list for the word segments, e.g. &quot;VERB|AUX|VERB&quot;; 9. frequency: number of occurrences of the NIL expression.</Paragraph> <Paragraph position="2"> We run the knowledge mining tool to extract all annotated NIL expressions together with their attributes from the NIL corpus. The NIL expressions are then each assigned an ID number and inserted into an indexed data file, i.e. the NIL dictionary. Current NIL dictionary contains 651 NIL entries.</Paragraph> </Section> <Section position="3" start_page="97" end_page="98" type="sub_section"> <SectionTitle> 4.3 NIL Feature Set </SectionTitle> <Paragraph position="0"> The NIL features are required by support vector machines method in NIL expression recognition.</Paragraph> <Paragraph position="1"> We define two types of statistical features for NIL expressions, i.e. Chinese word n-grams and POS tag n-grams. Bigger n leads to more contextual</Paragraph> <Paragraph position="3"> information, but results in higher computational complexity. To compromise, we generate n-grams with n = 1, 2, 3, 4. For example, &quot; /4?4 &quot; is a bi-gram for &quot;4?4 &quot; in terms of word segmentation, and its POS tag bi-gram is &quot;PRONOUN/ VERB&quot;.</Paragraph> <Paragraph position="4"> We run the knowledge mining tool on the NIL corpus to produce all n-grams for Chinese words and their POS tags in which NIL expression appears. 8379 features were generated including 7416 word-based n-grams and 963 POS tag-based n-grams. These statistical NIL features are linked to the corresponding NIL dictionary entries by their global NIL expression IDs.</Paragraph> <Paragraph position="5"> Besides, we consider some morphological features including being/containing a number, some English capitals or Chinese characters. These features can be extracted by parsing string of the NIL expressions.</Paragraph> </Section> </Section> <Section position="6" start_page="98" end_page="99" type="metho"> <SectionTitle> 5 NILER System </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="98" end_page="98" type="sub_section"> <SectionTitle> 5.1 Architecture </SectionTitle> <Paragraph position="0"> We develop NILER system to recognize NIL expressions in NIL text and convert them to normal language text. The latter functionality is discussed in other literatures. Architecture of NILER system is presented in Figure 2.</Paragraph> <Paragraph position="1"> The input chat text is first segmented and POS tagged with ICTCLAS tool. Because ICTCLAS is not able to identify NIL expressions, some expressions are broken into several segments. NIL expression recognizer processes the segments and POS tags and identifies the NIL expressions.</Paragraph> </Section> <Section position="2" start_page="98" end_page="99" type="sub_section"> <SectionTitle> 5.2 NIL Expression Recognizer </SectionTitle> <Paragraph position="0"> We implement two methods in NIL expression recognition, i.e. pattern matching and support vector machines.</Paragraph> <Paragraph position="1"> Pattern matching (PM) is a traditional method in information extraction systems. It uses a hand-crafted rule set and dictionary for this purpose. Because it's simple, fast and independent of corpus, this method is widely used in IE tasks. By applying NIL dictionary, candidates of NIL expressions are first extracted from the input text with longest matching. As ambiguity occurs constantly, 24 patterns are produced and employed to disambiguate. We first extract those word and POS tag n-grams from the NIL corpus and create patterns by generalizing them manually. An illustrative pattern is presented as follows.</Paragraph> <Paragraph position="2"> ]_[)_(8]_[ !!! anyvunitvnotanyv where anyv _ and unitv _ are variables denoting any word and any unit word respectively; )(xnot is the negation operator. The illustrative pattern determines &quot;8&quot; to be a NIL expression if it is succeeded by a unit word. With this pattern, &quot;8&quot; within sentence &quot;0Z (He has been working for eight hours.)&quot; is not recognized as a NIL expression.</Paragraph> <Paragraph position="3"> Support vector machines (SVM) method produces high performance in many classification tasks (Joachims, 1998; Kudo and Matsumoto, 2001). As SVM can handle large numbers of features efficiently, we employ SVM classification method to NIL expression recognition.</Paragraph> <Paragraph position="4"> Suppose we have a set of training data for a two-class classification problem {(x1,y1), (x2, y2),...,(xN, yN)}, where ),...2,1( NiRx Di is a feature vector of the i-th order sample in the training set and }1,1{ iy is the label for the sample.</Paragraph> <Paragraph position="5"> The goal of SVM is to find a decision function that accurately predicts y for unseen x. A non-linear SVM classifier gives a decision function ))(()( xgsignxf for an input vector x, where</Paragraph> <Paragraph position="7"> The szi are so-called support vectors, and represents the training samples. iY and b are parameters for SVM motel. l is number of training samples. ),( zxK is a kernel function that implic- null itly maps vector x into a higher dimensional space. A typical kernel is defined as dot products, i.e. )(),( zxkzxK x .</Paragraph> <Paragraph position="8"> Based on the training process, the SVM algorithm constructs the support vectors and parameters. When text is input for classification, it is first converted into feature vector x. The SVM method then classifies the vector x by determining sign of g(x), in which 1)( xf means that word x is positive and otherwise if 1)( xf . The SVM algorithm was later extended in SVMmulticlass to predict multivariate outputs (Joachims, 1998).</Paragraph> <Paragraph position="9"> In NIL expression recognition, we consider NIL corpus as training set and the annotated NIL expressions as samples. NIL expression recognition is achieved with the five-class SVM classification task, in which four classes are those defined in Section 2 and reflected by class attribute within NIL annotation scheme. The fifth class is NOCLASS, which means the input text is not any NIL expression class.</Paragraph> </Section> </Section> class="xml-element"></Paper>