File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/ackno/03/w03-1727_ackno.xml

Size: 7,036 bytes

Last Modified: 2025-10-06 13:50:32

<?xml version="1.0" standalone="yes"?>
<Paper uid="W03-1727">
  <Title>Chinese Word Segmentation in MSR-NLP</Title>
  <Section position="3" start_page="3" end_page="4" type="ackno">
    <SectionTitle>
2 Evaluation
</SectionTitle>
    <Paragraph position="0"> We participated in the four GB tracks in the first international Chinese word segmentation bakeoff -PK-open, PK-closed, CTB-open and CTB-closed and ranked #1, #2, #2, and #3 respectively in those tracks. In what follows, we discuss how we got the results: what dictionaries we used, how we used the training data, how much each component contributed to the scores, and the problems that affected our performance.</Paragraph>
    <Section position="1" start_page="3" end_page="3" type="sub_section">
      <SectionTitle>
2.1 Dictionaries
</SectionTitle>
      <Paragraph position="0"> In the open tracks, we used our proprietary dictionary of 89,845 entries, which includes the entries of 7,017 single characters. In the closed tracks, we removed from the dictionary all the words that did not appear in the training data, but kept all the single characters. This resulted in a dictionary of 34,681 entries in the PK track and 18,207 entries in the CTB track. It should be noted that not all the words in the training data are in our dictionary. This explains why the total numbers of entries in those reduced dictionaries are smaller than the vocabulary sizes of the respective training sets even with all the single-character entries included in them.</Paragraph>
      <Paragraph position="1"> The dictionary we use in each case is not a simple word list. Every word has one or more parts-of-speech and a number of other grammatical features. No word can be used by the parser unless it has those features. This made it very difficult for us to add all the words in the training data to the dictionary. We did use a semi-automatic process to add as many words as possible, but both the accuracy and coverage of the added grammatical features are questionable due to the lack of manual verification.</Paragraph>
    </Section>
    <Section position="2" start_page="3" end_page="4" type="sub_section">
      <SectionTitle>
2.2 Use of the training data
</SectionTitle>
      <Paragraph position="0"> We used the training data mainly to tune the segmentation parameters of our system. As has been mentioned in 1.6, there are about 50 types of morphologically derived words that are built online in our system and each type has a parameter to determine whether a given unit should be displayed as a single word or separate words. Since our default segmentation is very different from PK or CTB, and PK and CTB also follow different guidelines, we had to try different value combinations of the parameters in each case until we got the optimal settings.</Paragraph>
      <Paragraph position="1"> The main problem in the tuning is that many morphologically derived words have been lexicalized in our dictionary and therefore do not have the word-internal structures that they would have if they had been constructed dynamically. As a result, their segmentation is beyond the control of those parameters. To solve this problem, we used the training data to automatically identify all such cases, create a word-internal structure for each of them, and store the word tree in their lexical entries. null  This made it possible for the parameter values to apply to both the lexicalized and non-lexicalized words. This process can be fairly automatic if the annotation of the training data is completely consistent. However, as we have discovered, the training data is not as consistent as expected, which made total automation impossible.</Paragraph>
    </Section>
    <Section position="3" start_page="4" end_page="4" type="sub_section">
      <SectionTitle>
2.3 Contribution of each component
</SectionTitle>
      <Paragraph position="0"> After we received our individual scores and the reference testing data, we did some ablation ex- null The work is incomplete, since the trees were created only for those words that are in the training data provided. periments to find out the contribution of each system component in this competition. We turned off the components one at a time (except basic segmentation) and recorded the scores of each ablated system. The results are summarized in the following table, where &amp;quot;DM-NER&amp;quot; stands for &amp;quot;derivational morphology and named entity recognition&amp;quot;, &amp;quot;NW-ID&amp;quot; for &amp;quot;new word identification and lexicalization&amp;quot;, &amp;quot;pruning&amp;quot; for &amp;quot;lattice pruning&amp;quot; and &amp;quot;tuning&amp;quot; for &amp;quot;tuning of parameter values&amp;quot;. Each cell in the table has two percentages. The top one is the F-measure and the bottom one is the OOV word recall rate.</Paragraph>
      <Paragraph position="1">  gest impact on the scores across the board.</Paragraph>
      <Paragraph position="2"> * Derivational morphology and NE recognition is also a main contributor, especially in the PK sets, which presumably contains more named entities.</Paragraph>
      <Paragraph position="3"> * The impact of new word identification is minimal when the OOV word rate is low, such as in the PK-open case, but becomes more and more significant as the OOV rate increases.</Paragraph>
      <Paragraph position="4"> * Lattice pruning makes a big difference as well. Apparently it cannot be replaced by the parser in terms of the disambiguating function it performs. Another fact, which is not represented in the table, is that parsing is three times slower when lattice pruning is turned off.</Paragraph>
      <Paragraph position="5"> * The parser has very limited impact on the scores. Looking at the data, we find that parsing did help to resolve some of the most difficult cases of ambiguities and we would not be able to get the last few points without it. But it seems that most of the common problems can be solved without the parser. In one case (CTB closed), the score is higher when the parser is turned off.</Paragraph>
      <Paragraph position="6"> This is because the parser may prefer a structure where those dynamically recognized OOV words are broken up into smaller units. For practical purposes, therefore, we may choose to leave out the parser. 2.4 Problems that affected our performance The main problem is the definition of new words. While our system is fairly aggressive in recognizing new words, both PK and CTB are quite conservative in this respect. Expressions such as &amp;quot;Yuan Cang &amp;quot;, &amp;quot;Fan Fu &amp;quot;, &amp;quot;Er Bi Hou &amp;quot;, &amp;quot;Lu Jin Bu Jue &amp;quot; are considered single words in our system but not so in PK or CTB. This made our new word recognition do more harm than good in many cases, though the overall impact is positive. Consistency in the annotated corpora is another problem, but this affects every participant. We also had a technical problem where some sentences remained unsegmented simply because some characters are not in our dictionary. null</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML