File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/02/p02-1064_metho.xml
Size: 7,886 bytes
Last Modified: 2025-10-06 14:07:59
<?xml version="1.0" standalone="yes"?> <Paper uid="P02-1064"> <Title>An Empirical Study of Active Learning with Support Vector Machines for Japanese Word Segmentation</Title> <Section position="4" start_page="0" end_page="3" type="metho"> <SectionTitle> CX </SectionTitle> <Paragraph position="0"> are weights. Besides the AB 1. Build an initial classifier 2. While a teacher can label examples (a) Apply the current classifier to each unlabeled example (b) Find the D1 examples which are most informative for the classifier (c) Have the teacher label the subsample of D1 examples (d) Train a new classifier on all labeled exam-</Paragraph> <Section position="1" start_page="0" end_page="1" type="sub_section"> <SectionTitle> 3.1 General Framework of Active Learning </SectionTitle> <Paragraph position="0"> We use pool-based active learning (Lewis and Gale, 1994). SVMs are used here instead of probabilistic classifiers used by Lewis and Gale. Figure 1 shows an algorithm of pool-based active learning . There can be various forms of the algorithm depending on what kind of example is found informative.</Paragraph> </Section> <Section position="2" start_page="1" end_page="2" type="sub_section"> <SectionTitle> 3.2 Previous Algorithm </SectionTitle> <Paragraph position="0"> Two groups have proposed an algorithm for SVMs active learning (Tong and Koller, 2000; Schohn and Cohn, 2000) . Figure 2 shows the selection algorithm proposed by them. This corresponds to (a) and (b) in Figure 1.</Paragraph> <Paragraph position="1"> The figure described here is based on the algorithm by Lewis and Gale (1994) for their sequential sampling algorithm. Tong and Koller (2000) propose three selection algorithms. The method described here is simplest and computationally ef- null 1. Build an initial classifier.</Paragraph> <Paragraph position="2"> 2. While a teacher can label examples (a) Select D1 examples using the algorithm in Figure 2.</Paragraph> <Paragraph position="3"> (b) Have the teacher label the subsample of D1 examples.</Paragraph> <Paragraph position="4"> (c) Train a new classifier on all labeled examples. null (d) Add new unlabeled examples to the primary pool if a specified condition is true.</Paragraph> </Section> <Section position="3" start_page="2" end_page="3" type="sub_section"> <SectionTitle> 3.3 Two Pool Algorithm </SectionTitle> <Paragraph position="0"> We observed in our experiments that when using the algorithm in the previous section, in the early stage of training, a classifier with a larger pool requires more examples than that with a smaller pool does (to be described in Section 5). In order to overcome the weakness, we propose two new algorithms. We call them &quot;Two Pool Algorithm&quot; generically. It has two pools, i.e., a primary pool and a secondary one, and moves gradually unlabeled examples to the primary pool from the secondary instead of using a large pool from the start of training. The primary pool is used directly for selection of examples which are requested a teacher to label, whereas the secondary is not. The basic idea is simple. Since we cannot get good performance when using a large pool at the beginning of training, we enlarge gradually a pool of unlabeled examples.</Paragraph> <Paragraph position="1"> The outline of Two Pool Algorithm is shown in Figure 3. We describe below two variations, which are different in the condition at (d) in Figure 3.</Paragraph> <Paragraph position="2"> Our first variation, which is called Two Pool Algorithm A, adds new unlabeled examples to the primary pool when the increasing ratio of support vectors in the current classifier decreases, because the gain of accuracy is very little once the ratio is down.</Paragraph> <Paragraph position="3"> This phenomenon is observed in our experiments (Section 5). This observation has also been reported in previous studies (Schohn and Cohn, 2000).</Paragraph> <Paragraph position="4"> In Two Pool Algorithm we add new unlabeled examples so that the total number of examples including both labeled examples in the training set and unlabeled examples in the primary pool is doubled. For example, suppose that the size of a initial primary pool is 1,000 examples. Before starting training, there are no labeled examples and 1,000 unlabeled examples. We add 1,000 new unlabeled examples to the primary pool when the increasing ratio of support vectors is down after D8 examples has been labeled. Then, there are the D8 labeled examples and the (BEBNBCBCBC A0 D8) unlabeled examples in the primary pool. At the next time when we add new unlabeled examples, the number of newly added examples is 2,000 and then the total number of both labeled in the training set and unlabeled examples in the primary pool is 4,000.</Paragraph> <Paragraph position="5"> Our second variation, which is called Two Pool Algorithm B, adds new unlabeled examples to the primary pool when the number of support vectors of the current classifier exceeds a threshold CS. The CS is defined as:</Paragraph> <Paragraph position="7"> where AE is a parameter for deciding when unlabeled examples are added to the primary pool and C6 is the number of examples including both labeled examples in the training set and unlabeled ones in the primary pool. The AE must be less than the percentage of support vectors of a training set . When deciding how many unlabeled examples should be added to the primary pool, we use the strategy as described in the paragraph above.</Paragraph> </Section> </Section> <Section position="5" start_page="3" end_page="3" type="metho"> <SectionTitle> 4 Japanese Word Segmentation </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="3" end_page="3" type="sub_section"> <SectionTitle> 4.1 Word Segmentation as a Classification Task </SectionTitle> <Paragraph position="0"> Many tasks in natural language processing can be formulated as a classification task (van den Bosch Since typically the percentage of support vectors is small (e.g., less than 30 %), we choose around 10 % for AE. We need further studies to find the best value of AE before or during training. null et al., 1996). Japanese word segmentation can be viewed in the same way, too (Shinnou, 2000). Let a Japanese character sequence be D7 BP CR</Paragraph> </Section> </Section> <Section position="6" start_page="3" end_page="3" type="metho"> <SectionTitle> 4.2 Features </SectionTitle> <Paragraph position="0"> We assume that each character CR</Paragraph> </Section> <Section position="7" start_page="3" end_page="4" type="metho"> <SectionTitle> CX </SectionTitle> <Paragraph position="0"> has two attributes.</Paragraph> <Paragraph position="1"> The first attribute is a character type (D8 , katakana, kanji (Chinese characters), numbers, English alphabets, kanji-numbers (numbers written in Chinese), or symbols. A character type gives some hints to segment a Japanese sentence to words. For example, kanji is mainly used to represent nouns or stems of verbs and adjectives. It is never used for particles, which are always written in hiragana. Therefore, it is more probable that a boundary exists between a kanji character and a hiragana character. Of course, there are quite a few exceptions to this heuristics. For example, some proper nouns are written in mixed hiragana, kanji and katakana.</Paragraph> <Paragraph position="2"> The second attribute is a character code (CZ</Paragraph> </Section> <Section position="8" start_page="4" end_page="4" type="metho"> <SectionTitle> CX ). The </SectionTitle> <Paragraph position="0"> range of a character code is from 1 to 6,879. JIS X 0208, which is one of Japanese character set standards, enumerates 6,879 characters.</Paragraph> <Paragraph position="1"> We use here four characters to decide a word boundary. A set of the attributes of CR</Paragraph> </Section> class="xml-element"></Paper>