File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/06/w06-1807_intro.xml

Size: 4,071 bytes

Last Modified: 2025-10-06 14:04:05

<?xml version="1.0" standalone="yes"?>
<Paper uid="W06-1807">
  <Title>Merging Stories with Shallow Semantics</Title>
  <Section position="3" start_page="0" end_page="0" type="intro">
    <SectionTitle>
2 Extracting Clauses from Text
</SectionTitle>
    <Paragraph position="0"> The method we have adopted for extracting first-order clauses from text can be called 'semantic chunking.' This seems an appropriate term for two reasons. First, we use a syntactic chunker to identify noun groups and verb groups (i.e. non-recursive clusters of related words with a noun or verb head respectively). Second, we use a cascade of finite state rules to map from this shallow syntactic structure into first-order clauses; this cascade is conceptually very similar to the chunking method pioneered by Abney's Cass chunker (Abney, 1996).</Paragraph>
    <Paragraph position="1"> The text processing framework we have used draws heavily on a suite of XML tools developed  for generic XML manipulation (LTXML (Thompson et al., 1997)) as well as NLP-specific XML tools (LT-TTT (Grover et al., 2000), LT-CHUNK (Finch and Mikheev, 1997)). More recently, significantly improved upgrades of these tools have been developed, most notably the program lxtransduce, which performs rule-based transductions of XML structures. We have used lxtransduce both for the syntactic chunking (based on rule developed by Grover) and for construction of semantic clauses.</Paragraph>
    <Paragraph position="2"> The main steps in the processing pipeline are as follows:  1. Words and sentences are tokenized.</Paragraph>
    <Paragraph position="3"> 2. The words are tagged for their part of speech using the CandC tagger (Clark and Curran, 2004) and the Penn Treebank tagset.</Paragraph>
    <Paragraph position="4"> 3. Pronoun resolution is carried out using the Glencova Pronoun Resolution algorithm  (Halpin et al., 2004), based on a series of rules similar to the CogNIAC engine (Baldwin, 1997), but without gender information-based rules since this is not provided by the Penn Treebank tagset.</Paragraph>
    <Paragraph position="5"> 4. The words are then reduced to their morphological stem (lemma) using Morpha (Minnen et al., 2001).</Paragraph>
    <Paragraph position="6"> 5. The lxtransduce program is used to chunk the sentence into verb groups and noun groups.</Paragraph>
    <Paragraph position="7"> 6. In an optional step, words are tagged as Named Entities, using the CandC tagger trained on MUC data.</Paragraph>
    <Paragraph position="8"> 7. The partially chunked sentences are selectively mapped into semantic clauses in a series of steps, described in more detail below. 8. The XML representation of the clauses is converted using an XSLT stylesheet into a more conventional syntactic format for use by Prolog or other logic-based systems.</Paragraph>
    <Paragraph position="9"> The output of the syntactic processing is an XML file containing word elements which are heavily annotated with attributes. Following CoNLL BIO notation (Tjong et al., 2000), chunk information is recorded at the word level. Heads of noun groups and verb groups are assigned semantic tags such as arg and rel respectively. In addition, other semantically relevant forms such as conjunction, negation, and prepositions are also tagged. Most other input and syntactic information is discarded at this stage. However, we maintain a record through shared indices of which terms belong to the same chunks. This is used, for instance, to build coordinated arguments.</Paragraph>
    <Paragraph position="10"> Regular expressions over the semantically tagged elements are used to compose clauses, using the heuristic that an arg immediately preceding apredis the subject of the clause, whileargs following the pred are complements. Since the heads of verb groups are annotated for voice, we can treat passive clauses appropriately, yielding a representation that is equivalent to the active congener. We also implement simple heuristics that allow us to capture simple cases of control and verb phrase ellipsis in many cases.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML