File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/i05-3028_metho.xml

Size: 11,552 bytes

Last Modified: 2025-10-06 14:09:43

<?xml version="1.0" standalone="yes"?>
<Paper uid="I05-3028">
  <Title>Chinese Word Segmentation with Multiple Postprocessors in HIT-IRLab</Title>
  <Section position="3" start_page="0" end_page="173" type="metho">
    <SectionTitle>
2 System Description
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.1 Basic Segmentation
</SectionTitle>
      <Paragraph position="0"> When a line is input into the system, it is first split into sentences separated by period. The reason to split a line into sentences is that in named entity recognition, the processing of several shorter sentences can reach a higher named entity recall rate than that of a long sentence.</Paragraph>
      <Paragraph position="1"> The reason to split the line only by period is for the simplicity for programming, and the sentences separated by period are short enough to process.</Paragraph>
      <Paragraph position="2"> Then every sentence is segmented into single atoms. For example, a sentence like &amp;quot;HIT-IRLab tZ1`SIGHANAA#&amp;quot; will be segmented as &amp;quot;HIT-IRLab//t/Z/1/`/ /SIGHAN//A/A/#/&amp;quot;.</Paragraph>
      <Paragraph position="3"> After atom segmentation, a segment graph is created. The number of nodes in the graph is the number of atoms plus 1, and every atom corresponds to an arc in the graph.</Paragraph>
      <Paragraph position="4"> Then all the words in the dictionary2 that appear in the sentence will be added to the segment graph. The graph contains various information such as the bigram possibility of every word. Figure 1 shows the segment graph of the above sentence after basic segmentation.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="172" type="sub_section">
      <SectionTitle>
2.2 Factoid Recognition
</SectionTitle>
      <Paragraph position="0"> After basic segmentation, a graph with all the atoms and all the words in the dictionary is set up. On this basis, we find out all the factoids 2 The dictionary is trained with training corpus.</Paragraph>
      <Paragraph position="1">  Note: the probability of each word is not shown in the graph. such as numbers, times and e-mails with a set of rules. Then, we also add all these factoids to the segment graph.</Paragraph>
    </Section>
    <Section position="3" start_page="172" end_page="172" type="sub_section">
      <SectionTitle>
2.3 Named Entity Recognition
</SectionTitle>
      <Paragraph position="0"> Then we will recognize the named entities such as persons and locations. First, we select N3 best paths from the segment graph with Dijkstra algorithm. Then for every path of the N+1 paths4 (N best paths and the atom path), we perform a process of Roles Tagging with HMM model (Zhang et al. 2003). The process of it is much like that of Part of Speech Tagging. Then with the best role sequence of every path, we can find out all the named entities and add them to the segment graph as usual. Take the sentence &amp;quot; T#Q:*&amp;quot; for example. After basic segmentation and factoid recognition, the N+1 paths are as follows: //T#///Q/:*/ //T#///Q/:/*/ Then for each path, the process of Roles Tagging is performed and the following role sequences are generated:</Paragraph>
      <Paragraph position="2"> From these role sequences, we can find out that &amp;quot;XSW&amp;quot; is a 3-character Chinese name. So the word &amp;quot;T#&amp;quot; is recognized as a person  name and be added to the segment graph.</Paragraph>
      <Paragraph position="3"> 3 N is a constant which is 8 in our system. 4 It may be smaller than N+1 if the sentence is short enough; exactly, N+1 is the upper bound of the path number. null 5 X, S, W, N and O are all roles for person name recogni null tion, X is surname, S is the first character of given name, W is the second character of given name, N is the word following a person name, and O is other remote context. We defined 17 roles for person name recognition and 10 roles for location name recognition.</Paragraph>
    </Section>
    <Section position="4" start_page="172" end_page="172" type="sub_section">
      <SectionTitle>
2.4 Merging of Adjoining Words
</SectionTitle>
      <Paragraph position="0"> After the steps above, the segment graph is completed and a best word sequence is generated with Dijkstra algorithm. This merging operation and all the following operations are done to the best word sequence.</Paragraph>
      <Paragraph position="1"> There are many inconsistencies in the PK corpus. For example, in PK training corpus, the word &amp;quot;&amp;quot; sometimes is considered as one word, but sometimes is considered as two separate words as &amp;quot;&amp;quot;. The inconsistencies lower the system's performance to some extent.</Paragraph>
      <Paragraph position="2"> To solve this problem, we first train from the training corpus the probability of a word to be one word and the probability to be two separate words. Then we perform a process of merging: if two adjoining words in the best word sequence are more likely to be one word, then we just merge them together.</Paragraph>
    </Section>
    <Section position="5" start_page="172" end_page="172" type="sub_section">
      <SectionTitle>
2.5 Morphologically Derived Word Recog-
</SectionTitle>
      <Paragraph position="0"> nition To deal with the words with the postfix like &amp;quot;&amp;quot;, &amp;quot;5&amp;quot;, &amp;quot;)[&amp;quot; and so on, we perform the process to merge the preceding word and the postfix into one word. We train a list of postfixes from the training corpus. Then we scan the best word sequence, if there is a single character word that appears in the postfix list, we merge the preceding word and this postfix into one word. For example, a best word sequence like &amp;quot;KSC5D=&amp;quot; will be converted to &amp;quot;KSC5D=&amp;quot; after this operation.</Paragraph>
    </Section>
    <Section position="6" start_page="172" end_page="173" type="sub_section">
      <SectionTitle>
2.6 New Word Identification
</SectionTitle>
      <Paragraph position="0"> As for the words that are not in the dictionary and cannot be identified with the steps above, we perform a process of New Word Identification (NWI). We train from the training corpus the probability of a word to be independent and the probability to be a special part of another word. In our system, we only consider the words that have one or two characters. Then we scan  the best word sequence, if the product of the probabilities of two adjoining words exceed a threshold, then we merge the two words into one word.</Paragraph>
      <Paragraph position="1"> Take the word &amp;quot;*;&amp;quot; for example. It is segmented as &amp;quot;*;&amp;quot; after all the above steps since this word is not in the dictionary. We find that the word &amp;quot;*&amp;quot; has a probability of 0.83 to be the first character of a two character word, and the word &amp;quot;;&amp;quot; has a probability of 0.94 to be the last character of a two character word. The product of them is 0.78 which is larger than 0.65, which is the threshold in our system. So the word &amp;quot;*;&amp;quot; is recognized as a single word.</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="173" end_page="173" type="metho">
    <SectionTitle>
3 Tracks
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="173" end_page="173" type="sub_section">
      <SectionTitle>
3.1 Closed Track
</SectionTitle>
      <Paragraph position="0"> As for the PK closed track, we first extract all the common words and tokens from the training corpus and set up a dictionary of 55,335 entries.</Paragraph>
      <Paragraph position="1"> Then we extract every kind of named entity respectively. With these named entities, we train parameters for Roles Tagging. We also train all the other parameters mentioned in Section 2 with the training corpus.</Paragraph>
    </Section>
    <Section position="2" start_page="173" end_page="173" type="sub_section">
      <SectionTitle>
3.2 Open Track
</SectionTitle>
      <Paragraph position="0"> The PK open track is similar to closed one. In open track, we use all the 6 months corpus of People's Daily and set up a dictionary of 107,749 entries. Additionally, we find 101 new words from the Web and add them to the dictionary. We train the parameters of named entity recognition with a person list and a location list in our laboratory. The training of other parameters is the same with closed track.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="173" end_page="173" type="metho">
    <SectionTitle>
4 Experiments and Discussions
</SectionTitle>
    <Paragraph position="0"> We do several experiments on PK test corpus to see the contribution of each postprocessor. We cut off one postprocessor at a time from the complete system and record its F-score. The evaluation results are shown in Table 1. In the table, MDW represents Morphologically De- null postprocessor cut off at a time From Table 1, we can come to some interesting facts: ! The Merging of Adjoining Words has good effect on both open and closed tracks. So we can conclude that this module can solve the problem of inconsistent training corpus to some extent.</Paragraph>
    <Paragraph position="1"> ! Morphologically Derived Word Recognition does some harm in open track, but it has a very good effect in closed track.</Paragraph>
    <Paragraph position="2"> Maybe it is because that in open track, we can make a comparatively larger dictionary since we can use any resource we have. So most MDWs6 are in the dictionary and the MDWs that are not in the dictionary are mostly difficult to recognize. So it does more harm than good in many cases. But in closed track, we have a small dictionary and many common MDWs are not in the dictionary. So it does much more good in closed track.</Paragraph>
    <Paragraph position="3"> ! New Word Identification is minimal in both open and closed tracks. Maybe it is because that the above steps have recognized the most OOV words and it is hard to recognize any more new words.</Paragraph>
  </Section>
  <Section position="6" start_page="173" end_page="173" type="metho">
    <SectionTitle>
5 External Factors That Affect Our
</SectionTitle>
    <Paragraph position="0"> Performance The difference on the definition of words is the main factor that affects our performance. In many cases such as &amp;quot;=4 &amp;quot;, &amp;quot;U&amp;quot;, &amp;quot;4~ NV&amp;quot; are all considered as one word in our system but not so in the PK gold standard corpus. Another factor is the inconsistencies in training corpus, although this problem has been solved to some extent with the module of merging. But</Paragraph>
  </Section>
  <Section position="7" start_page="173" end_page="174" type="metho">
    <SectionTitle>
6 It refers to Morphologically Derived Words.
</SectionTitle>
    <Paragraph position="0"> because the inconsistencies also exist in test corpus and there are some instances that a word is more likely to be a single word in training corpus but more likely to be separated into two words in test corpus. For example, the word &amp;quot;2 C&amp;quot; is more likely to be a single word in training corpus but is more likely to be separated into two words in test corpus. There is another factor that affects MDW, many postfixes in our system are not considered as postfixes in PK gold standard corpus. For example, the word &amp;quot;0N$&amp;quot; is recognized as a MDW in our system since &amp;quot;$&amp;quot; is a postfix, however, it is segmented into two separate words as &amp;quot;0N$&amp;quot; in PK gold standard corpus.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML