File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/98/p98-1031_intro.xml

Size: 2,245 bytes

Last Modified: 2025-10-06 14:06:35

<?xml version="1.0" standalone="yes"?>
<Paper uid="P98-1031">
  <Title>Named Entity Scoring for Speech Input</Title>
  <Section position="2" start_page="0" end_page="201" type="intro">
    <SectionTitle>
2. Scoring Procedure
</SectionTitle>
    <Paragraph position="0"> The scoring algorithm proceeds in five stages:  1. Preprocessing to prepare data for alignment 2. Alignment of lexemes in the reference and hypothesis files 3. Named entity mapping to determine corresponding phrases in the reference and hypothesis files 4. Comparison of the mapped entities in terms of tag type, tag extent and tag content 5. Final computation of the score t MUC &amp;quot;named entities&amp;quot; include person, organization  and location names, as well as numeric expressions. -'Indeed, the Tipster scoring and annotation algorithms require, as part of the Tipster architecture, that the annotation preserve the underlying text including white space. The MUC named entity scoring algorithm uses character offsets to compare the mark-up of two texts.</Paragraph>
    <Section position="1" start_page="201" end_page="201" type="sub_section">
      <SectionTitle>
2.1 Stage 1: Preprocessing
</SectionTitle>
      <Paragraph position="0"> The algorithm takes three files as input: the human-transcribed reference file with key NE phrases, the speech recognizer output, which includes coarse-grained timestamps used in the alignment process, and the recogizer output tagged with NE mark-up.</Paragraph>
      <Paragraph position="1"> The first phase of the scoring algorithm involves reformatting these input files to allow direct comparison of the raw text. This is necessary because the transcript file and the output of the speech recognizer may contain information in addition to the lexemes. For example, for the Broadcast News corpus provided by the Linguistic Data Consortium, 4 the transcript file contains, in addition to mixed-case text representing the words spoken, extensive SGML and pseudo-SGML annotation including segment timestamps, speaker identification, background noise and music conditions, and comments. In the preprocessing phase, this ref: AT THE NEW YORK DESK I'M PHILIP BOROFF hyp: AT THE NEWARK BASK ON FILM FORUM MISSES</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML