File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/02/p02-1055_intro.xml

Size: 4,787 bytes

Last Modified: 2025-10-06 14:01:29

<?xml version="1.0" standalone="yes"?>
<Paper uid="P02-1055">
  <Title>Shallow parsing on the basis of words only: A case study</Title>
  <Section position="3" start_page="0" end_page="0" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> It is common in parsing to assign part-of-speech (POS) tags to words as a first analysis step providing information for further steps. In many early parsers, the POS sequences formed the only input to the parser, i.e. the actual words were not used except in POS tagging. Later, with feature-based grammars, information on POS had a more central place in the lexical entry of a word than the identity of the word itself, e.g. MAJOR and other HEAD features in (Pollard and Sag, 1987). In the early days of statistical parsers, POS were explicitly and often exclusively used as symbols to base probabilities on; these probabilities are generally more reliable than lexical probabilities, due to the inherent sparseness of words.</Paragraph>
    <Paragraph position="1"> In modern lexicalized parsers, POS tagging is often interleaved with parsing proper instead of being a separate preprocessing module (Collins, 1996; Ratnaparkhi, 1997). Charniak (2000) notes that having his generative parser generate the POS of a constituent's head before the head itself increases performance by 2 points. He suggests that this is due to the usefulness of POS for estimating back-off probabilities. null Abney's (1991) chunking parser consists of two modules: a chunker and an attacher. The chunker divides the sentence into labeled, non-overlapping sequences (chunks) of words, with each chunk containing a head and (nearly) all of its premodifiers, exluding arguments and postmodifiers. His chunker works on the basis of POS information alone, whereas the second module, the attacher, also uses lexical information. Chunks as a separate level have also been used in Collins (1996) and Ratnaparkhi (1997).</Paragraph>
    <Paragraph position="2"> This brief overview shows that the main reason for the use of POS tags in parsing is that they provide Computational Linguistics (ACL), Philadelphia, July 2002, pp. 433-440. Proceedings of the 40th Annual Meeting of the Association for useful generalizations and (thereby) counteract the sparse data problem. However, there are two objections to this reasoning. First, as naturally occurring text does not come POS-tagged, we first need a module to assign POS. This tagger can base its decisions only on the information present in the sentence, i.e.</Paragraph>
    <Paragraph position="3"> on the words themselves. The question then arises whether we could use this information directly, and thus save the explicit tagging step. The second objection is that sparseness of data is tightly coupled to the amount of training material used. As training material is more abundant now than it was even a few years ago, and today's computers can handle these amounts, we might ask whether there is now enough data to overcome the sparseness problem for certain tasks.</Paragraph>
    <Paragraph position="4"> To answer these two questions, we designed the following experiments. The task to be learned is a shallow parsing task (described below). In one experiment, it has to be performed on the basis of the &amp;quot;gold-standard&amp;quot;, assumed-perfect POS taken directly from the training data, the Penn Treebank (Marcus et al., 1993), so as to abstract from a particular POS tagger and to provide an upper bound.</Paragraph>
    <Paragraph position="5"> In another experiment, parsing is done on the basis of the words alone. In a third, a special encoding of low-frequency words is used. Finally, words and POS are combined. In all experiments, we increase the amount of training data stepwise and record parse performance for each step. This yields four learning curves. The word-based shallow parser displays an apparently log-linear increase in performance, and surpasses the flatter POS-based curve at about 50,000 sentences of training data. The low-frequency variant performs even better, and the combinations is best. Comparative experiments with a real POS tagger produce lower results.</Paragraph>
    <Paragraph position="6"> The paper is structured as follows. In Section 2 we describe the parsing task, its input representation, how this data was extracted from the Penn Treebank, and how we set up the learning curve experiments using a memory-based learner. Section 3 provides the experimental learning curve results and analyses them. Section 4 contains a comparison of the effects with gold-standard and automatically assigned POS.</Paragraph>
    <Paragraph position="7"> We review related research in Section 5, and formulate our conclusions in Section 6.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML