File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/05/w05-0611_intro.xml

Size: 5,803 bytes

Last Modified: 2025-10-06 14:03:12

<?xml version="1.0" standalone="yes"?>
<Paper uid="W05-0611">
  <Title>Improving sequence segmentation learning by predicting trigrams</Title>
  <Section position="4" start_page="81" end_page="82" type="intro">
    <SectionTitle>
2 Data and methodology
</SectionTitle>
    <Paragraph position="0"> The three data sets we used for this study represent a varied set of sentence-level chunking tasks of both syntactic and semantic nature: English base phrase chunking (henceforth CHUNK), English named-entity recognition (NER), and disfluency chunking in transcribed spoken Dutch utterances (DISFL).</Paragraph>
    <Paragraph position="1"> CHUNK is the task of splitting sentences into non-overlapping syntactic phrases or constituents. The used data set, extracted from the WSJ Penn Treebank, contains 211,727 training examples and 47,377 test instances. The examples represent seven-word windows of words and their respective (predicted) part-of-speech tags, and each example is labeled with a class using the IOB type of segmentation coding as introduced by Ramshaw and Marcus (1995), marking whether the middle word is inside (I), outside (O), or at the beginning (B) of a chunk. Words occuring less than ten times in the training material are attenuated (converted into a more general string that retains some of the word's surface form). Generalization performance is measured by the F-score on correctly identified and labeled constituents in test data, using the evaluation method originally used in the &amp;quot;shared task&amp;quot; sub-event of the CoNLL-2000 conference (Tjong Kim Sang and Buchholz, 2000) in which this particular training and test set were used. An example sentence with base phrases marked and labeled is the following: [He]NP [reckons]V P [the current account deficit]NP [will narrow]V P [to]PP [only $ 1.8 billion]NP [in]PP [September]NP .</Paragraph>
    <Paragraph position="2"> NER, named-entity recognition, is to recognize and type named entities in text. We employ the English NER shared task data set used in the CoNLL2003 conference, again using the same evaluation method as originally used in the shared task (Tjong Kim Sang and De Meulder, 2003). This data set discriminates four name types: persons, organizations, locations, and a rest category of &amp;quot;miscellany names&amp;quot;. The data set is a collection of newswire articles from the Reuters Corpus, RCV11. The given training set contains 203,621 examples; as test set we use the &amp;quot;testb&amp;quot; evaluation set which contains 46,435 examples. Examples represent seven-word windows of unattenuated words with their respective predicted part-of-speech tags. No other task-specific features such as capitalization identifiers or seed list features were used. Class labels use the IOB segmentation coding coupled with the four possible name type labels. Analogous to the CHUNK task, generalization performance is measured by the F-score on correctly identified and labeled named entities in test data. An example sentence with the named entities segmented and typed is the following: [U.N.]organization official [Ekeus]person heads for [Baghdad]location.</Paragraph>
    <Paragraph position="3"> DISFL, disfluency chunking, is the task of recognizing subsequences of words in spoken utterances such as fragmented words, laughter, selfcorrections, stammering, repetitions, abandoned constituents, hesitations, and filled pauses, that are not part of the syntactic core of the spoken utterance. We use data introduced by Lendvai et al.</Paragraph>
    <Paragraph position="4"> (2003), who extracted the data from a part of the Spoken Dutch Corpus of spontaneous speech2 that is both transcribed and syntactically annotated. All words and multi-word subsequences judged not to be part of the syntactic tree are defined as disfluent chunks. We used a single 90% - 10% split of the data, producing a training set of 303,385 examples and a test set of 37,160 examples. Each example represents a window of nine words (attenuated below an occurrence threshold of 100) and 22 binary features representing various string overlaps (to encode possible repetitions); for details, cf. (Lendvai  et al., 2003). Generalization performance is measured by the F-score on correctly identified disfluent chunks in test data. An example of a chunked Spoken Dutch Corpus sentence is the following (&amp;quot;uh&amp;quot; is a filled pause; without the disfluencies, the sentence means &amp;quot;I have followed this process with a certain amount of scepticism for about a year&amp;quot;): [ik uh] ik heb met de nodige scepsis [uh] deze gang van zaken [zo'n] zo'n jaar aangekeken.</Paragraph>
    <Paragraph position="5"> We perform our experiments on the three tasks using three machine-learning algorithms: the memory-based learning or k-nearest neighbor algorithm as implemented in the TiMBL software package (version 5.1) (Daelemans et al., 2004), henceforth referred to as MBL; maximum-entropy classification (Guiasu and Shenitzer, 1985) as implemented in the maxent software package (version 20040930) by Zhang Le3, henceforth MAXENT; and a sparse-winnow network (Littlestone, 1988) as implemented in the SNoW software package (version 3.0.5) by Carlson et al. (1999), henceforth WINNOW. All three algorithms have algorithmic parameters that bias their performance; to allow for a fair comparison we optimized each algorithm on each task using wrapped progressive sampling (Van den Bosch, 2004) (WPS), a heuristic automatic procedure that, on the basis of validation experiments internal to the training material, searches among algorithmic parameter combinations for a combination likely to yield optimal generalization performance on unseen data. We used wrapped progressive sampling in all experiments.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML