File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/92/m92-1024_intro.xml
Size: 2,823 bytes
Last Modified: 2025-10-06 14:05:18
<?xml version="1.0" standalone="yes"?> <Paper uid="M92-1024"> <Title>BOMBINGACCOMPLISHED 'THE BOMB &quot;&quot;EXPLOSIVES &quot; BOMB: 'THE BOMB &quot;EXPLOSIVE: &quot;EXPLOSIVES&quot; TERRORIST ACT&quot;GUERRILLAS&quot; &quot;FMLN &quot; REPORTED AS FACT: &quot;FMLN&quot;&quot;MERINO'S HOME&quot;</Title> <Section position="3" start_page="0" end_page="169" type="intro"> <SectionTitle> SYSTEM ARCHITECTUR E </SectionTitle> <Paragraph position="0"> The PLUM architecture is presented in Figure 1 .</Paragraph> <Paragraph position="1"> Preprocessing The input to the system is a file containing one or more messages. The preprocessing module determines message boundaries, identifies the header, and determines paragraph and sentence boundaries. The specification of the input format is now a declarative component of the preprocessor, which enables us to easily digest messages i n different formats. This component has proved its utility in porting to two non-MUC formats in the last year .</Paragraph> <Section position="1" start_page="0" end_page="169" type="sub_section"> <SectionTitle> Morphological Analysis </SectionTitle> <Paragraph position="0"> The first phase of the processing is assignment of part-of-speech information . In BBN's Fast Partial Parser (FPP) [2], a bi-gram probability model, frequency models for known words (derived from large corpora), an d heuristics based on word endings for unknown words, assign part of speech to the highly ambiguous words of th e corpus. Since these predictions for unknown words were very inaccurate for input that is all upper case, w e augmented this part-of-speech tagging with probabilistic models (automatically trained) for recognizing words o f Spanish origin and words of English origin . This allowed us to tag new words that were actually Latin America n names highly reliably. The Spanish classifier uses a 5 character hidden Markov model, trained on about 30,00 0 words of Spanish text. The five-gram model of words of English was derived from text from the Wall Stree t</Paragraph> </Section> <Section position="2" start_page="169" end_page="169" type="sub_section"> <SectionTitle> Parsing </SectionTitle> <Paragraph position="0"> The FPP is a deterministic stochastic parser which does not attempt to generate a single syntactic interpretation of the whole sentence, rather, it generates one or more non-overlapping parse fragments spanning the inpu t sentence, deferring difficult decisions on attachment ambiguities . FPP produces an average of seven fragments for sentences of the complexity seen in the MUC-4 corpus' .</Paragraph> <Paragraph position="1"> Here are the 8 parse fragments generated by FPP for the first sentence of TST2-MUC4-0048 :</Paragraph> </Section> </Section> class="xml-element"></Paper>