File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/92/m92-1024_intro.xml

Size: 2,823 bytes

Last Modified: 2025-10-06 14:05:18

<?xml version="1.0" standalone="yes"?>
<Paper uid="M92-1024">
  <Title>BOMBINGACCOMPLISHED 'THE BOMB &amp;quot;&amp;quot;EXPLOSIVES &amp;quot; BOMB: 'THE BOMB &amp;quot;EXPLOSIVE: &amp;quot;EXPLOSIVES&amp;quot; TERRORIST ACT&amp;quot;GUERRILLAS&amp;quot; &amp;quot;FMLN &amp;quot; REPORTED AS FACT: &amp;quot;FMLN&amp;quot;&amp;quot;MERINO'S HOME&amp;quot;</Title>
  <Section position="3" start_page="0" end_page="169" type="intro">
    <SectionTitle>
SYSTEM ARCHITECTUR E
</SectionTitle>
    <Paragraph position="0"> The PLUM architecture is presented in Figure 1 .</Paragraph>
    <Paragraph position="1"> Preprocessing The input to the system is a file containing one or more messages. The preprocessing module determines message boundaries, identifies the header, and determines paragraph and sentence boundaries. The specification of the input format is now a declarative component of the preprocessor, which enables us to easily digest messages i n different formats. This component has proved its utility in porting to two non-MUC formats in the last year .</Paragraph>
    <Section position="1" start_page="0" end_page="169" type="sub_section">
      <SectionTitle>
Morphological Analysis
</SectionTitle>
      <Paragraph position="0"> The first phase of the processing is assignment of part-of-speech information . In BBN's Fast Partial Parser (FPP) [2], a bi-gram probability model, frequency models for known words (derived from large corpora), an d heuristics based on word endings for unknown words, assign part of speech to the highly ambiguous words of th e corpus. Since these predictions for unknown words were very inaccurate for input that is all upper case, w e augmented this part-of-speech tagging with probabilistic models (automatically trained) for recognizing words o f Spanish origin and words of English origin . This allowed us to tag new words that were actually Latin America n names highly reliably. The Spanish classifier uses a 5 character hidden Markov model, trained on about 30,00 0  words of Spanish text. The five-gram model of words of English was derived from text from the Wall Stree t</Paragraph>
    </Section>
    <Section position="2" start_page="169" end_page="169" type="sub_section">
      <SectionTitle>
Parsing
</SectionTitle>
      <Paragraph position="0"> The FPP is a deterministic stochastic parser which does not attempt to generate a single syntactic interpretation of the whole sentence, rather, it generates one or more non-overlapping parse fragments spanning the inpu t sentence, deferring difficult decisions on attachment ambiguities . FPP produces an average of seven fragments for sentences of the complexity seen in the MUC-4 corpus' .</Paragraph>
      <Paragraph position="1"> Here are the 8 parse fragments generated by FPP for the first sentence of TST2-MUC4-0048 :</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML