File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/i05-3018_metho.xml

Size: 2,649 bytes

Last Modified: 2025-10-06 14:09:44

<?xml version="1.0" standalone="yes"?>
<Paper uid="I05-3018">
  <Title>Combination of Machine Learning Methods for Optimum Chinese Word Segmentation Masayuki Asahara Chooi-Ling Goh Kenta Fukuoka</Title>
  <Section position="4" start_page="134" end_page="135" type="metho">
    <SectionTitle>
3 Model b
</SectionTitle>
    <Paragraph position="0"> Model b uses a different approach. First, we extract the OOV words using a MaxEnt classifier with only the character as the features. We did not use the character classes as the features. Each character is assigned with BIES position tags. Word segmentation by character-based tagging is firstly introduced by (Xue and Converse, 2002). In encoding, we extract characters within five-character window size for each character position in the training data as the features for the classifier.</Paragraph>
    <Paragraph position="1"> In decoding, the BIES position tag is deterministically annotated character by character in the test data. The  words that appear only in the test data are treated as OOV word candidates.</Paragraph>
    <Paragraph position="2"> We can obtain quite high unknown word recall with this model but the precision is a bit low. However, the following segmentation model will try to eliminate some false unknown words. In the next step, we append OOV word candidates into the IV word list extracted from the training data. The segmentation model is similar to the OOV extraction method, except that the features include the output from the Maximum Matching (MaxMatch) algorithm. The algorithm runs in both forward (FMaxMatch) and backward (BMax-Match) directions using the final word list as the references. The outputs of FMaxMatch and BMaxMatch are also assigned with BIES tags. The differences between the FMaxMatch and BMaxMatch outputs indicate the positions where the overlapping ambiguities occur. The final word segmentation is carried out by MaxEnt classifier again.</Paragraph>
    <Paragraph position="3"> Note, both procedures in Model b use whole training data in the training phase. The dictionary used in the MaxMatch algorithm is extracted from the training data only during the training phase. So, the training of segmentation model does not explicitly consider OOV words. We did not use the word and character classes as features in Model b unlike in the case of Models a and c. The details of the model can be found in (Goh et al., 2004b). The difference is that we do not provide character types here because it is forbidden in this round. Besides, we also did not prune the OOV words because this step involve the intervention of human knowledge.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML