File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/97/a97-2004_metho.xml

Size: 3,934 bytes

Last Modified: 2025-10-06 14:14:34

<?xml version="1.0" standalone="yes"?>
<Paper uid="A97-2004">
  <Title>Duke's Trainable Information and Meaning Extraction System</Title>
  <Section position="2" start_page="0" end_page="7" type="metho">
    <SectionTitle>
2 System Architecture
</SectionTitle>
    <Paragraph position="0"> As illustrated in Figure 1, there are three main stages in the running of the system: the Training Process, Rule Generalization, and the Scanning Process. During the Training Process, the user, with the help of a graphical user interface, takes a few prototypical articles from the domain that the system is being trained on, and creates rules (patterns) for the target information contained in the training articles. These rules are specific to the training articles and they are generalized so that they can be run on other articles from the domain. The Rule Generalization routines, with the help of WordNet 1 (Miller, 1990), generalize the specific rules generated by the Training Process. The system can now be run on a large number of articles from the domain (Scanning Process). The output of the Scanning Process, for each article, is a semantic network for that article which can then be used by a Postprocessor to fill Supported by Fellowships from IBM Corporation.</Paragraph>
    <Paragraph position="1"> lWordNet is an on-line lexical reference system developed by George Miller at Princeton University.</Paragraph>
    <Paragraph position="2"> t....- ~. WordNet \[ Role Generalization Routines  templates, answer queries, or generate abstracts.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.1 Tools Used By the System
</SectionTitle>
      <Paragraph position="0"> In addition to WordNet, the system uses IBM's LanguageWare English Dictionary, IBM's Computing Terms Dictionary, and a local dictionary of our choice. The system also uses a gazetteer consisting of approximately 250 names of cities, states, and countries.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="7" type="sub_section">
      <SectionTitle>
2.2 The Tokenizer, the Preprocessor, and
the Partial Parser
</SectionTitle>
      <Paragraph position="0"> The Tokenizer accepts ASCII characters as input and produces a stream of tokens (words) as output.</Paragraph>
      <Paragraph position="1"> It also determines sentence boundaries.</Paragraph>
      <Paragraph position="2"> The preprocessor tries to identify some important entities like names of companies, proper names, etc.</Paragraph>
      <Paragraph position="3"> contained in the article. Groups of words that comprise these entities are collected together and con- null sidered as one item for all future processing.</Paragraph>
      <Paragraph position="4"> The Partial Parser produces a sequence of non-overlapping phrases as output. The headword of each phrase is also identified. The parser recognizes noun groups, verb groups and preposition groups 2 (Hobbs, 1993).</Paragraph>
    </Section>
    <Section position="3" start_page="7" end_page="7" type="sub_section">
      <SectionTitle>
2.3 The Training Interface
</SectionTitle>
      <Paragraph position="0"> There are two parts to the Training Process: identification of the (WordNet) sense usage of headwords of interest, and the building of specific rules. Training is done by a user with the help of a graphical user Training Interface.</Paragraph>
    </Section>
  </Section>
  <Section position="3" start_page="7" end_page="7" type="metho">
    <SectionTitle>
3 Generalization
</SectionTitle>
    <Paragraph position="0"> Rules created as a result of the Training Process are very specific and can only be applied to exactly the same patterns as the ones present during the training. Generalization consists of replacing each concept in a rule by a more generalized concept (obtained from WordNet). Figure 2 shows the different degrees of generalization of the concept &amp;quot;IBM Cor-</Paragraph>
    <Paragraph position="2"/>
  </Section>
class="xml-element"></Paper>
Download Original XML