File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/95/w95-0113_intro.xml
Size: 2,080 bytes
Last Modified: 2025-10-06 14:06:00
<?xml version="1.0" standalone="yes"?> <Paper uid="W95-0113"> <Title>Development of a Partially Bracketed Corpus with Part-of-Speech Information Only</Title> <Section position="2" start_page="0" end_page="0" type="intro"> <SectionTitle> 1. Introduction </SectionTitle> <Paragraph position="0"> Research based on a treebank, i.e., a corpus annotated with syntactic structures, is active for many natural language applications \[1-5\]. Framis \[1\] proposes a methodology to extract selectional restrictions at a variable level of abstraction from the Penn Treebank. Chen and Chen \[2\] propose a probabilistic chunker to decide the implicit boundaries of constituents and utilize the linguistic knowledge to extract the noun phrases by a finite state mechanism. In their study, Susanne Corpus is used as a trainmg corpus for their chunker. Pocock and Atwell \[3\] investigate statistical grammars extracted from Spoken English Corpus (SEC), and apply these grammars to find the grammatically optimal path through a word lattice. The stochastic parsers are also developed in \[4,5\]. All these applications employ the syntactic information extracted from different treebanks and show the satisfactory results.</Paragraph> <Paragraph position="1"> However, the work to build a large scale treebank is laborious and tedious. Very few large-scale treebanks are currently available especially for languages other than English. In this paper, we propose a probabilistic chunker to help the development of a partially bracketed corpus, i.e., a simpler version of a treebank. The chunker partitions the part-of-speech sequence into segments called chunks. Rather than using a treebank as our training corpus, a corpus which is tagged with part-of-speech information only is used. In the following sections we first introduce the experimental framework of our model. Lancaster-Oslo/Bergen (LOB) Corpus and Susanne Corpus are adopted. Then a tag mapper and a probabilistic chunker are described. Before concluding the experimental results are demonstrated.</Paragraph> </Section> class="xml-element"></Paper>