File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/w05-0639_metho.xml

Size: 3,429 bytes

Last Modified: 2025-10-06 14:09:56

<?xml version="1.0" standalone="yes"?>
<Paper uid="W05-0639">
  <Title>The Integration of Syntactic Parsing and Semantic Role Labeling</Title>
  <Section position="4" start_page="238" end_page="238" type="metho">
    <SectionTitle>
4 Parsing Experiments
</SectionTitle>
    <Paragraph position="0"> We trained a maximum-entropy parser based on (Ratnaparkhi, 1999) using the OpenNLP package 2. We started our experiments with this specific parsing implementation because of its excellent flexibility that allows us to test different features. Besides, this parser contains four clear parse tree building stages: TAG, CHUNK, BUILD, and CHECK.</Paragraph>
    <Paragraph position="1"> This parsing structure offers us an isolated working environment for each stage that helps us confine necessary implementation modifications and trace down implementation errors.</Paragraph>
    <Section position="1" start_page="238" end_page="238" type="sub_section">
      <SectionTitle>
4.1 Data Preparation
</SectionTitle>
      <Paragraph position="0"> Following standard practice, we use Sec 02-21 of the Penn Treebank and the PropBank as our training corpus. The constituent labels defined in the Penn Treebank consist of a primary label and several secondary labels. A primary label represents the major syntactic function carried by the constituent, for instance, NP indicates a noun phrase and PP indicates a prepositional phrase. A secondary label, starting with &amp;quot;-&amp;quot;, represents either a grammatical function of a constituent or a semantic function of an adjunct.</Paragraph>
      <Paragraph position="1"> For example, NP-SBJ means the noun phrase is a surface subject of the sentence; PP-LOC means the prepositional phrase is a location. Although the sec- null ondary labels give us much to encourage information, because of data sparseness problem and training efficiency, we stripped off all the secondary labels from the Penn Treebank.</Paragraph>
      <Paragraph position="2"> After stripping off the secondary labels from the Penn Treebank, we augment the constituent labels with the semantic argument information from the PropBank. We adopted four different labels, -AN, -ANC, -AM, and -AMC. If the constituent in the Penn Treebank is a core argument, which means the constituent has one of the labels of ARG0-5 and ARGA in the PropBank, we attach -AN to the constituent label. The label -ANC means the constituent is a discontinuous core argument. Similarly, -AM indicates an adjunct-like argument, ARGM, and -AMC indicates a discontinuous ARM.</Paragraph>
      <Paragraph position="3"> For example, the sentence from Sec 02, [ARG0 The luxury auto maker] [ARGM-TMP last year] sold [ARG1 1,214 cars] [ARGM-LOC in the U.S.], would appear in the following format in our training corpus: (S (NP-AN (DT The) (NN luxury) (NN</Paragraph>
    </Section>
    <Section position="2" start_page="238" end_page="238" type="sub_section">
      <SectionTitle>
4.2 The 2 Different Parsers
</SectionTitle>
      <Paragraph position="0"> Since the core arguments and the ARGMs in the PropBank loosely correspond to the complements and adjuncts in the linguistics literature, we are interested in investigating their individual effect on parsing performance. We trained two parsers. An AN-parser was trained on the Penn Treebank corpus augmented with two semantic argument labels: -AN, and -ANC. Another AM-parser was trained on labels -AM, and -AMC.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML