File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/97/a97-1015_intro.xml

Size: 2,801 bytes

Last Modified: 2025-10-06 14:06:16

<?xml version="1.0" standalone="yes"?>
<Paper uid="A97-1015">
  <Title>The Domain Dependence of Parsing</Title>
  <Section position="4" start_page="0" end_page="96" type="intro">
    <SectionTitle>
2 Data and Tools
</SectionTitle>
    <Paragraph position="0"> The definition of domain will dominate the performance of our experiments, so it is very important to choose a proper corpus. However, for practical reasons (availability and time constraint), we decided to use an existing multi-domain corpus which has naturally acceptable domain definition. In order to acquire grammar rules in our experiment, we need a syntactically tagged corpus consisting of different domains, and the tagging has to be uniform throughout the corpus. To meet these requirements, the Brown Corpus (Francis and Kucera, 1964) on the distribution of PennTreeBank version 1 (Marcus et.al., 1995) is used in our experiments. The corpus consists of 15  domains as shown in Appendix A; in the rest of the paper, we use the letters from the list to represent the domains. Each sample consists of about the same size of text in terms of the number of words (2000 words), although a part of the data is discarded because of erroneous data format.</Paragraph>
    <Paragraph position="1"> For the parsing experiment, we use 'Apple Pie Parser' (Sekine, 1995) (Sekine, 1996). It is a probabilistic, bottom-up, best-first search, chart parser and its grammar can be obtained from a syntactically-tagged corpus. We acquire two-nonterminal grammars from corpus. Here, 'two-nonterminal grammar' means a grammar which uses only 'S' (sentence) and 'NP' (noun phrase) as actual non-terminals in the grammar and other grammatical nodes, like 'VP' or 'PP', are embedded into a rule. In other words, all rules can only have either 'S' or 'NP' as their left hand-side symbol. This strategy is useful to produce better accuracy corrlpared to all non-terminal grammar. See (Sekine, 1995) for details.</Paragraph>
    <Paragraph position="2"> In this experiment, grammars are acquired from the corpus of a single domain, or from some combination of domains. In order to avoid the unknown word problem, we used a general dictionary to supplement the dictionary acquired from corpus. Then, we apply each of the grammars to some texts of different domains. We use only 8 domains (A,B,E,J,K,L,N and P) for this experiment, because we want to fix the corpus size for each domain, and we want to have the same number of domains for the non-fiction and the fiction domains. The main objective is to observe the parsing performance based on the grammar acquired from the same domain compared with the performance based on grammars of different domains, or combined domains. Also, the issue of the size of training corpus will be discussed.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML