XML Viewer - w00-0732

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/00/w00-0732_metho.xml
Size: 8,000 bytes
Last Modified: 2025-10-06 14:07:22
<?xml version="1.0" standalone="yes"?>
<Paper uid="W00-0732">
  <Title>Improving Chunking by Means of Lexical-Contextual Information in Statistical Language Models</Title>
  <Section position="2" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 Overview of the system
</SectionTitle>
    <Paragraph position="0"> The baseline system described in (Pla et al., 2000a) uses bigrams, formalized as finite-state automata. It is a transducer composed of two levels (see Figure 1). The upper one (Figure la) represents the contextual LM for the sentences.</Paragraph>
    <Paragraph position="1"> The symbols associated to the states are POS tags (Ci) and chunk descriptors (Si). The lower one modelizes the different chunks considered (Figure lb). In this case, the symbols are the POS tags (Ci) that belong to the corresponding chunk (Si). Next, a regular substitution of the lower models into the upper level is made (Figure lc). In this way, we get a single Integrated LM which shows the possible concatenations of lexical tags and chunks. Also, each state is relabeled with a tuple (Ci, Sj) where Ci E g and Sj E S. g is the POS tag set used and S = {\[Si, Si\], Si, S0} is the chunk set defined. \[Si and Si\] stand for the initial and the final state of chunk whose descriptor is Si. The label Si is assigned to those states which are inside Si chunk, and So is assigned to those states which are outside of any chunk. All the LMs involved have been smoothed by using a back-off technique (Katz, 1987). We have not specified lexical probabilities in every state of the different contextual models. We assumed that P(WjI(Ci, Si)) = P(WjlCi ) for every Si E S.</Paragraph>
    <Paragraph position="2"> Once the integrated transducer has been made, the tagging and shallow parsing process consists of finding the sequence of states of maximum probability on it for an input sentence.</Paragraph>
    <Paragraph position="3"> Therefore, this sequence must be compatible with the contextual, syntactical and lexical constraints. This process can be carried out by dynamic programming using the Viterbi algorithm (Viterbi, 1967), which has been appropriately modified to use our models. From the dynamic programming trellis, we can obtain the maximum probability path for the input sentence through the model, and thus the best sequence of lexical tags and the best segmentation in chunks, in a single process.</Paragraph>
  </Section>
  <Section position="3" start_page="0" end_page="148" type="metho">
    <SectionTitle>
3 Specialized Contextual Language
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="148" type="sub_section">
      <SectionTitle>
Models
</SectionTitle>
      <Paragraph position="0"> The contextual model for the sentences and the models for chunks (and, therefore, the ILM) can be modified taking into account certain words in the context where they appear. This specialization us allows to set certain contextual constraints which modify the contextual LMs and improve the performance of the chunker (as shown below). This set of words can be defined using some heuristics such as: the most frequent words in the training corpus, the words with a higher tagging error rate, the words that belong to closed classes (prepositions, pronouns, etc.), or whatever word chosen following some linguis- null To do this, we added to the POS tag set the set of structural tags (Wi, Cj) for each specialized word Wi in all of their possible categories Cj. Then, we relabelled the training corpus: if a word Wi was labelled with the POS tag Cj, we changed Cj for the pair (Wi, Cj). The learning process of the bigram LMs was carried out from this new training data set.</Paragraph>
      <Paragraph position="1"> The Contextual LMs obtained has some specific states which are related to the specialized words. In the basic Language Model (ILM), a state was labelled by (Ci, Sj). In the specialized ILM, a state was specified for a certain word Wk (only if the Wk word belongs to the category Ci). In this way, the state is relabelled with the tuple (Wk, Ci, Sj) and only the word Wk can be emitted with a probability equal to 1.</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="148" end_page="149" type="metho">
    <SectionTitle>
4 Experimental Work
</SectionTitle>
    <Paragraph position="0"> We applied both approaches (ILM and specialized ILM) using the training and test data of the CoNLL-2000 shared task (http://lcgwww.uia.ac.be/conll2000). We also evaluated how the performance of the chunker varies when we modify the specialized word set. Nevertheless, the use of our approach on other corpora (including different languages), other lexical tag sets or other kinds of chunks can be done in a direct way.</Paragraph>
    <Paragraph position="1"> Although our system is able to carry out tagging and chunking in a single process, we will not present tagging results for this task, as the POS tags of the data set used are not supervised and, therefore, a comparison is not possible.</Paragraph>
    <Paragraph position="2"> We would like to point out that we have simulated a morphological analyzer for English. We have constructed a tag dictionary with the lexicon of the training set and the test set used. This dictionary gave us the possible lexical tags for each word from the corpus. In no case, was the test used to estimate the lexical probabilities. null As stated above, several criterion can be chosen to define the set of specialized words. We have selected the most frequent words in the training data set. We have not taken into account certain words such as punctuation symbols, proper nouns, numbers, etc. This fact did not decrease the performance of the chunker and also reduced the number of states of the contextual LMs. Figure 2 shows how the performance of the chunker (Fz=I) improves as a function of the size of the specialized word set. The best results were obtained with the set of words whose frequency in the training corpus was larger than 80 (about 470 words). We obtained similar results when only considering the words of the training set belonging to closed classes (that, about, as, if, out, while, whether, for, to, ...). In Table 1 we present the results of chunking with the specialized ILM. When comparing these results with the results obtained using the basic ILM, we observed that, in general, the F- null score was improved for each chunk. The best improvement was observed for SBAR (from 0.37 to 79.46), PP (from 88.94 to 95.51) and PRT (38.82 to 66.67).</Paragraph>
  </Section>
  <Section position="5" start_page="149" end_page="149" type="metho">
    <SectionTitle>
5 Conclusions
</SectionTitle>
    <Paragraph position="0"> In this paper, we have presented a system for Tagging and Chunking based on an Integrated Language Model that uses a homogeneous formalism (finite-state machine) to combine different knowledge sources. It is feasible both in terms of performance and also in terms of computational efficiency.</Paragraph>
    <Paragraph position="1"> All the models involved are learnt automatically from data, so the system is very flexible with changes in the reference language, changes in POS tags or changes in the definition of chunks.</Paragraph>
    <Paragraph position="2"> Our approach allows us to use any regular model which has been previously defined or learnt. In previous works, we have used bi-grams (Pla et al., 2000a), and we have combined them with other more complex models which had been learnt using grammatical inference techniques (Pla et al., 2000b). In this work, we used only bigram models improved with lexical-contextual information.</Paragraph>
    <Paragraph position="3"> The Ff~ score obtained increased from 86.64 to 90.14 when we used the specialized ILM. Never-</Paragraph>
    <Paragraph position="5"> theless, we believe that the models could be improved with a more detailed study of the words whose contextual information is really relevant to tagging and chunking.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML