File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/05/i05-6003_intro.xml

Size: 8,368 bytes

Last Modified: 2025-10-06 14:03:01

<?xml version="1.0" standalone="yes"?>
<Paper uid="I05-6003">
  <Title>A Study of Applying BTM Model on the Chinese Chunk Bracketing</Title>
  <Section position="3" start_page="25" end_page="27" type="intro">
    <SectionTitle>
3 Experiment Results
</SectionTitle>
    <Paragraph position="0"> To conduct the following experiments in tenfolds, we randomly select 50,000 trees of CCTB2.1 and separate them into the following  two sets: (1) Training Set consists of 45,000 CCTB2.1 trees; and (2) Open Testing Set consists of the other 5,000 CCTB2.1 trees.</Paragraph>
    <Paragraph position="1">  In our computation, 66% of CCTB2.1 BL POS patterns in the open testing set are not found in the training set. This means the ratio of unseen CCTB2.1 BL POS patterns in the open testing set is 66%. The PCTB4 BTM dataset was not used in this study by two reasons: the first one is that the PCTB is not a balanced CTB; the second one is that the POS tagging system of PCTB is not a hierarchical system.</Paragraph>
    <Paragraph position="2"> We conducted four experiments in this study. The first three experiments are designed to show the relationships between the chunk bracketing performance of the BTM model on the matching sentences and the three BTM parameters: POS layer number; BTM threshold value; and BTM training size. To avoid the error propagation of word segmentation and POS tagging, the first three experiments only consider open testing sentences with correct word segmentations and POS tags provided in CCTB2.1 as perfect input. The fourth experiment is to show the BTM model is able to improve the performance (Fmeasure) of N-gram models on Chinese chunk bracketing for both perfect input and actual input. Here, the actual input means the word segmentations and POS tags of the testing sentences were all generated by a forward maximum matching (FMM) segmenter and a bigram-based POS tagger, respectively.</Paragraph>
    <Paragraph position="3"> To evaluate the performance of our BTM model, we use recall (R), precision (P), and F-measure (F) (Manning and Schuetze, 1999), which are defined as follows:</Paragraph>
    <Paragraph position="5"> In addition, we use coverage ratio (CR) to represent the size of matching sentences (or say, matching set) of our BTM model. The CR is defined as: Coverage Ratio (CR) = (# of not NULL output sentences) / (# of total testing sentences) (4)</Paragraph>
    <Section position="1" start_page="25" end_page="25" type="sub_section">
      <SectionTitle>
3.1 Relationship between POS layer number
</SectionTitle>
      <Paragraph position="0"> and BTM performance In the 1st experiment, the BTM threshold value is set to 1 and the BTM training size is set to 45,000. Table 7 is the first experimental results of BTM performance (P, R, F) and CR for the POS layer numbers are 1, 2, 3, 4 or 5. From Table 7, it shows the POS layer number is positively related to the F-measure. Since the BTM model with POS layer number 2 is able to achieve more than 96% F-measure, we use POS layer number 2 to conduct the following experiments. This experimental result seems to indicate that the CCTB2.1 dataset with POS layer number 2 (including 57 distinct POS tags) can provide sufficient information for the BTM model to achieve an F-measure of more than 96% and a maximum CR of 46.88%.</Paragraph>
    </Section>
    <Section position="2" start_page="25" end_page="26" type="sub_section">
      <SectionTitle>
3.2 Relationship between BTM threshold
</SectionTitle>
      <Paragraph position="0"> value and BTM performance In the 2nd experiment, the POS layer number is set to 2 and the BTM training size is set to  45,000. Table 8 is the second experimental results of BTM performance and CR when the BTM threshold value is 1.0, 0.9, 0.8, 0.7, 0.6 or 0.5. From Table 8, it shows the BTM threshold value is positively related to the F-measure. Besides, the F-measure difference between threshold values 1.0 and 0.5 is only 1.37%. This result indicates that the BTM model can robustly maintain an F-measure of more than 95% and a CR of more than 46% while the POS layer number is set to 2, BTM training size is set to 45,000 and the BTM threshold value is t 0.5.</Paragraph>
    </Section>
    <Section position="3" start_page="26" end_page="26" type="sub_section">
      <SectionTitle>
3.3 Relationship between BTM training size
</SectionTitle>
      <Paragraph position="0"> and BTM performance In the 3rd experiment, the BTM threshold value is set to 0.5 and POS layer number is set to 2. Table 9 is the third experimental results of BTM performance and CR when the BTM training size is 5000, 10000, 15000, 20000, 25000, 30000, 35000, 40000 or 45000. From Table 9, it seems to indicate that the F-measure of the BTM model is independent of the training size because the maximum difference between these respective F-measures is only 0.88%.</Paragraph>
      <Paragraph position="1">  To sum up the above three experimental results (Tables 7-9), it shows that the F-measure (overall performance) of our BTM model with POS layer number (t 2) is apparently not sensitive to BTM threshold value (t 0.5) and BTM training size (t 5,000) on the matching set with perfect input. Since the CR of our BTM model is positively related to BTM training size, it indicates our BTM model should be able to maintain the high performance chunk bracketing (more than 95% F-measure on the matching set with perfect input) and increase the CR only by enlarging the BTM training size.</Paragraph>
    </Section>
    <Section position="4" start_page="26" end_page="27" type="sub_section">
      <SectionTitle>
3.4 Comparative study of the N-gram model
</SectionTitle>
      <Paragraph position="0"> and the BTM model on perfect/actual input To conduct the 4th experiment, we develop N-gram models (NGM) by the SRILM (Stanford Research Institute Language Modeling) toolkit (Stolcke, 2002) as the baseline model. SRILM is a freely available collection of C++ libraries, executable programs, and helper scripts designed to allow both production of, and experimentation with, statistical language models for speech recognition and other NLP applications (Stolcke, 2002). In this experiment, the TL POS patterns (such as &amp;quot;&lt;Na:DE:Na+VH:VH&gt;&amp;quot;) of training set were used as the data for SIRLM to build N-gram models. Then, use these N-gram models to determine the chunks for each BL POS pattern in the testing set. Note that these N-gram models were trained by the TL POS patterns only, not by each layer's POS patterns.</Paragraph>
      <Paragraph position="1"> Figure 2 shows the distribution of n-gram patterns of N-gram models (N is from 2 to 44) trained by the training set.</Paragraph>
      <Paragraph position="2">  Tables 10, 11, 12 13 and 14 are the results of the fourth experiment. The explanations of the five tables are given below.</Paragraph>
      <Paragraph position="3">  From Table 10, it shows the maximum precision, recall and F-measure of N-gram models all occur at the 4-gram model for perfect input.</Paragraph>
      <Paragraph position="4"> Thus, we use the 4-gram model as the baseline model in this experiment. Tables 11 and 12 are the comparative experimental results of the baseline model and the BTM model on the matching sets of perfect input and actual input, respectively. From Table 11, it shows the performance (95.1% F-measure) of a BTM (0.5, 2, 45,000) is 5.6% greater than that of a 4-gram model (89.5% F-measure) for the matching set with perfect input. From Table 12, it shows the performance (97.3% F-measure) of a BTM (0.5, 2, 45,000) is 1.4% greater than that of a 4-gram model (95.9% F-measure) for the matching set with actual input. Table 13 is the experimental results of applying the BTM model to the matching set and the 4-gram model to the non-matching set. From Table 13, it shows the F-measure of a 4-gram model can be improved by the BTM model for both perfect input (2.5% increasing) and actual input (1% increasing).</Paragraph>
      <Paragraph position="5"> According to all the four experimental results, we have: (1) the BTM model can achieve better F-measure performance than N-gram models on the matching sets for both perfect input and actual input; and (2) the chunk bracketing performance of the BTM model for the matching sets should be high and stable against training size, perfect and actual input while POS layer number t 2 and BTM threshold value t 0.5.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML