File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/05/i05-3022_evalu.xml
Size: 4,234 bytes
Last Modified: 2025-10-06 13:59:27
<?xml version="1.0" standalone="yes"?> <Paper uid="I05-3022"> <Title>Chinese Word Segmentation in FTRD Beijing</Title> <Section position="4" start_page="150" end_page="152" type="evalu"> <SectionTitle> 3 Evaluation </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="150" end_page="150" type="sub_section"> <SectionTitle> 3.1 Open tracks </SectionTitle> <Paragraph position="0"> In the open tracks, we used four lexicons of 210,319 entries, 165,103 entries, 174,268 entries, 165,655 entries respectively on AS-open, HKopen, MSR-open, PK-open tracks, which include the entries of 2,430 MDWs, 12,487 PNs, 22,907 LNs and 29,032 ONs, 10,414 four-character idioms, plus the word lists generated from the training data provided by the bakeoff.</Paragraph> <Paragraph position="1"> We use the training data provided by the bakeoff for training our trigram word-based language model. We also used a family name list (which contains 399 entries in our system), and a 1,021entry transliterated name character list.</Paragraph> </Section> <Section position="2" start_page="150" end_page="150" type="sub_section"> <SectionTitle> 3.2 Closed tracks </SectionTitle> <Paragraph position="0"> In the close tracks, the lexicon we use could only be generated from the training data provided by the bakeoff. We could only use the training data provided by the bakeoff for training our word-based language model. Also, since the training data we used is only from the bakeoff, there does not exist any different standards, standards adaptor component is not necessarily needed.</Paragraph> </Section> <Section position="3" start_page="150" end_page="152" type="sub_section"> <SectionTitle> 3.3 Result analysis </SectionTitle> <Paragraph position="0"> Our system is designed so that components such as the factoid detection and NE identification can be switched on or off, so that we can investigate the relative contribution of each component to the overall word segmentation performance. The results are summarized in the table 1. For comparison, we also include in the table (Row 1) the results of using FMM. Row 2 shows the baseline results of our system, where only the lexicon is used. Each cell in the table has six fields. From the top, there are respectively Precision, Recall, F-measure, OOV Recall, IV Recall and Speed (Mega bytes/second). We don't list the speed in Row 6 since it decreases a factor of 10 to 60 because of application of thousands of TBL rules.</Paragraph> <Paragraph position="1"> From Table 1 we can find that, in rows 1 and 2, the dictionary-based methods already achieve quite good recall, but the precisions are not very good because they cannot correctly identify unknown words that are not in the lexicon such as factoids and name entities. We also find that even using the same lexicon, our approach that is based on the N-gram language models outperforms the greedy approach because the use of context model resolves more ambiguities in segmentation. As shown in Rows 3 to 5, when components are switched on in turn, the overall word segmentation performance increases consistently. The morphological analysis has no contribution to the overall performance in Row 4. The main reason is that the number of MDWs used in our system is very small (only 2,430) and there may exist very small MDWs in the test sets. The similar cases occur on NE identification in the close tracks in Row 5 since we would not do NE identification at all in the close tracks. We also notice that the contribution of NE identification is very little in the open tracks, which shows that the performance of NE identification is not very good in our system, and explains why our OOV recall is not very high compared with other participants in the bakeoff. This is one area of our future work to improve. The results of standards adaptation on four bakeoff test sets are shown in Row 6. It turns out that performance except IV recall improves slightly across the board in all four test sets. The main reason is that the training data and lexicon we used are mainly from the four providers in the bakeoff, there does not exist any different segmentation standards.</Paragraph> </Section> </Section> class="xml-element"></Paper>