File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/i05-3027_metho.xml
Size: 8,802 bytes
Last Modified: 2025-10-06 14:09:43
<?xml version="1.0" standalone="yes"?> <Paper uid="I05-3027"> <Title>A Conditional Random Field Word Segmenter for Sighan Bakeoff 2005</Title> <Section position="4" start_page="168" end_page="170" type="metho"> <SectionTitle> 3 Feature engineering </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="168" end_page="168" type="sub_section"> <SectionTitle> 3.1 Features </SectionTitle> <Paragraph position="0"> The linguistic features used in our model fall into three categories: character identity n-grams, morphological and character reduplication features. null For each state, the character identity features (Ng & Low 2004, Xue & Shen 2003, Goh et al.</Paragraph> <Paragraph position="1"> 2003) are represented using feature functions that key off of the identity of the character in the current, proceeding and subsequent positions.</Paragraph> <Paragraph position="2"> Specifically, we used four types of unigram feature functions, designated as C0 (current character), C1 (next character), C-1 (previous character), C-2 (the character two characters back). Furthermore, four types of bi-gram features were used, and are notationally designated here as conjunctions of the previously specified unigram features, C0C1, C-1C0, C-1C1, C-2C-1, and C2C0.</Paragraph> <Paragraph position="3"> Given that unknown words are normally more than one character long, when representing the morphological features as feature functions, such feature functions keyed off the morphological information extracted from both the proceeding state and the current state. Our morphological features are based upon the intuition regarding unknown word features given in Gao et al. (2004). Specifically, their idea was to use productive affixes and characters that only occurred independently to predict boundaries of unknown words. To construct a table containing affixes of unknown words, rather than using threshold-filtered affix tables in a separate unknown word model as was done in Gao et al.</Paragraph> <Paragraph position="4"> (2004), we first extracted rare words from a corpus and then collected the first and last characters to construct the prefix and suffix tables. For the table of individual character words, we collected an individual character word table for each corpus of the characters that always occurred alone as a separate word in the given corpus. We also collected a list of bi-grams from each training corpus to distinguish known strings from unknown. Adopting all the features together in a model and using the automatically generated morphological tables prevented our system from manually overfitting the Mandarin varieties we are most familiar with.</Paragraph> <Paragraph position="5"> The tables are used in the following ways: 1) C-1+C0 unknown word feature functions were created for each specific pair of characters in the bi-gram tables. Such feature functions are active if the characters in the respective states match the corresponding feature function's characters. These feature functions are designed to distinguish known strings from unknown.</Paragraph> <Paragraph position="6"> 2) C-1, C0, and C1 individual character feature functions were created for each character in the individual character word table, and are likewise active if the respective character matches the feature function's character.</Paragraph> <Paragraph position="7"> 3) C-1 prefix feature functions are defined over characters in the prefix table, and fire if the character in the proceeding state matches the feature function's character.</Paragraph> <Paragraph position="8"> 4) C0 suffix feature functions are defined over suffix table characters, and fire if the character in the current state matches the feature function's character.</Paragraph> <Paragraph position="9"> Additionally, we also use reduplication feature functions that are active based on the repetition of a given character. We used two such feature functions, one that fires if the previous and the current character, C-1 and C0, are identical and one that does so if the subsequent and the previous characters, C-1 and C1, are identical.</Paragraph> <Paragraph position="10"> Most features appeared in the first-order templates with a few of character identity features in the both zero-order and first-order templates.</Paragraph> <Paragraph position="11"> We also did normalization of punctuations due to the fact that Mandarin has a huge variety of punctuations.</Paragraph> <Paragraph position="12"> Table 1 shows the number of data features and lambda weights in each corpus.</Paragraph> </Section> <Section position="2" start_page="168" end_page="169" type="sub_section"> <SectionTitle> 3.2 Experiments </SectionTitle> <Paragraph position="0"> Experiments done while developing this system showed that its performance was significantly better than that of Peng et al. (2004). As seen in Table 2, our system's F-score was 0.863 on CTB (Chinese Treebank from Univer- null sity of Pennsylvania) versus 0.849 F on Peng et al. (2004). We do not at present have a good understanding of which aspects of our system give it superior performance.</Paragraph> <Paragraph position="1"> Our final system achieved a F-score of 0.947 (AS), 0.943 (HK), 0.950 (PK) and 0.964 (MSR). This shows that our system successfully generalized and achieved state of the art performance on all four corpora.</Paragraph> <Paragraph position="2"> Table 3 Performance of the features cumulatively, starting with the n-gram.</Paragraph> <Paragraph position="3"> Table 3 lists our results on the four corpora. We give our results using just character identity based features; character identity features plus unknown words and reduplication features. Our unknown word features only helped on AS and MSR. Both of these corpora have words that have more characters than HK and PK. This indicates that our unknown word features were more useful for corpora with segmentation standards that tend to result in longer words. In the HK corpus, when we added in unknown word features, our performance dropped. However, we found that the testing data uses different punctuation than the training set. Our system could not distinguish new word characters from new punctuation, since having a complete punctuation list is considered external knowledge for closed track systems. If the new punctuation were not unknown to us, our performance on HK data would have gone up to 0.952 F and the unknown word features would have not hurt the system too much.</Paragraph> <Paragraph position="4"> Table 4 present recalls (R), precisions (P), f-scores (F) and recalls on both unknown (Roov) and known words (Riv).</Paragraph> </Section> <Section position="3" start_page="169" end_page="170" type="sub_section"> <SectionTitle> 3.3 Error analysis </SectionTitle> <Paragraph position="0"> Our system performed reasonably well on morphologically complex new words, such as L4p3 (CABLE in AS) and ! (MUR-DER CASE in PK), where 3 (LINE) and (CASE) are suffixes. However, it overgeneralized to words with frequent suffixes such as&P (it should be&P&quot;to burn someone&quot; in PK) and E (it should beE &quot;to look backward&quot; in PK). For the corpora that considered 4 character idioms as a word, our system combined most of new idioms together.</Paragraph> <Paragraph position="1"> This differs greatly from the results that one would likely obtain with a more traditional MaxMatch based technique, as such an algorithm would segment novel idioms.</Paragraph> <Paragraph position="2"> One short coming of our system is that it is not robust enough to distinguish the difference between ordinal numbers and numbers with measure nouns. For example, H (3rd year) andH (three years) are not distinguishable to our system. In order to avoid this problem, it might require having more syntactic knowledge than was implicitly given in the training data. Finally, some errors are due to inconsistencies in the gold segmentation of non-hanzi character. For example, &quot;Pentium4&quot; is a word, but &quot;PC133&quot; is two words. Sometimes, 8 is a word, but sometimes it is segmented into two words.</Paragraph> </Section> </Section> <Section position="5" start_page="170" end_page="170" type="metho"> <SectionTitle> 4 Conclusion </SectionTitle> <Paragraph position="0"> Our system used a conditional random field sequence model in conjunction with character identity features, morphological features and character reduplication features. We extracted our morphological information automatically to prevent overfitting Mandarin from particular Mandarin-speaking area. Our final system achieved a F-score of 0.947 (AS), 0.943 (HK), 0.950 (PK) and 0.964 (MSR).</Paragraph> </Section> class="xml-element"></Paper>