File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/98/p98-1031_metho.xml

Size: 11,833 bytes

Last Modified: 2025-10-06 14:14:57

<?xml version="1.0" standalone="yes"?>
<Paper uid="P98-1031">
  <Title>Named Entity Scoring for Speech Input</Title>
  <Section position="3" start_page="201" end_page="201" type="metho">
    <SectionTitle>
MISSISSIPPI &lt;/L&gt; REPUBLICAN
MISSES THE REPUBLICAN
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="201" end_page="201" type="sub_section">
      <SectionTitle>
2.2 Stage 2: Lexeme Alignment
</SectionTitle>
      <Paragraph position="0"> A key component of the scoring process is the actual alignment of individual lexemes in the reference and hypothesis documents. This task is similar to the alignment that is used to evaluate word error rates of speech recognizers: we match lexemes in the hypothesis text with their corresponding lexemes in the reference text.</Paragraph>
      <Paragraph position="1"> The standard alignment algorithm used for word error evaluation is a component of the NIST SCLite scoring package used in the Broadcast News evaluations (Garofolo 97). For each lexeme, it provides four possible classifications of the alignment: correct, substitution, insertion, and deletion. This classification has been successful for evaluating word error. However, it restricts alignment to a one-to-one mapping between hypothesis and reference texts. It is very common for multiple lexemes in one text to correspond to a single lexeme in the other, in addition to multiple-to-multiple correspon-</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="201" end_page="202" type="metho">
    <SectionTitle>
MISSISSIPPI REPUBLICAN
THE REPUBLICAN
</SectionTitle>
    <Paragraph position="0"> ref: AT THE NEW YORK DESK I'M PHILIP BOROFF MISSISSIPPI REPUBLICAN hyp: At&amp;quot; THE N~-~/~U&lt; BASK ON FILM FORUM MISSES THE REPUBLICAN Example 2: SCLite alignment (top) vs. phonetic alignment (bottom)  annotation and all punctuation is removed, and all remaining text is converted to upper-case.</Paragraph>
    <Paragraph position="1"> Each word in the reference text is then assigned an estimated timestamp based on the explicit timestamp of the larger parent segmentJ Given the sequence of all the timestamped words in each file, a coarse segmentation and alignment is performed to assist the lexeme alignment in Stage 2. This is done by identifying sequences of three or more identical words in the reference and hypothesis transcriptions, transforming the long sequence into a set of shorter sequences, each with possible mismatches. Lexeme alignment is then performed on these short  of the recognizer ouput, but in general the average sequence is 20-30 words long after this coarse segmentation.</Paragraph>
    <Paragraph position="2"> dences. For example, compare New York and Newark in Example 1. Capturing these alignment possibilities is especially important in evaluating NE performance, since the alignment facilitates phrase mapping and comparison of tagged regions.</Paragraph>
    <Paragraph position="3"> In the current implementation of our scoring algorithm, the alignment is done using a phonetic alignment algorithm (Fisher 93). In direct comparison with the standard alignment algorithm in the SCLite package, we have found that the phonetic algorithm results in more intuitive results. This can be seen clearly in Example 2, which repeats the reference and hypothesis texts of the previous example. The top alignment is that produced by the SCLite algorithm; the bottom by the phonetic algorithm. Since this example contains several instances of potential named entities, it also illustrates the impact of different alignment algorithms (and alignment errors) on phrase mapping and comparison. We will compare the effect of the two algorithms on the NE score in Section 3.</Paragraph>
    <Paragraph position="4">  ref: INVESTING * : . * &amp;quot;-~ * ~,~ :,, \[ TRADING JNITH CUBA hyp: INVESTING IN TRAINING i WOULD ,. KEEP OFF &amp;quot; .A~.. LOT ! OF ref: ImrEsTING AND ,TmmIMG~ i WItH !': ~&amp;quot;/i * FROM OTTAWA rH~S, IS hyp: INVESTING:. IN TRAINING WOULD KEEP 0FF A i~'LOT .; OF WHAT THIS ~ IS Example 3: Imperfect alignments (SCLite top, phonetic bottom)  Even the phonetic algorithm makes alignment mistakes. This can be seen in Example 3, where, as before, SCLite's alignment is shown above that of the phonetic algorithm. Once again, we judge the latter to be a more intuituive alignment--nonetheless, OTTAWA would arguably align better with the three word sequence LOT OF WHAT. As we shall see, these potential misalignments are taken into account in the algorithm's mapping and comparison phases.</Paragraph>
    <Section position="1" start_page="202" end_page="202" type="sub_section">
      <SectionTitle>
2.3 Stage 3: Mapping
</SectionTitle>
      <Paragraph position="0"> The result of the previous phase is a series of alignments between the words in the reference text and those in a recognizer's hypothesis. In both of these texts there is named-entity (NE) markup. The next phase is to map the reference NEs to the hypothesis NEs. The result of this will be corresponding pairs of reference and hypothesis phrases, which will be compared for correctness in Stage 4.</Paragraph>
      <Paragraph position="1"> Currently, the scorer uses a simple, greedy mapping algorithm to find corresponding NE pairs. Potential mapped pmrs are those that overlap--that is, if some word(s) in a hypothesis NE have been aligned with some word(s) in a reference NE, the reference and hypothesis NEs may be mapped to one another. If more than one potential mapping is possible, this is currently resolved in simple left-to-right fashion: the first potential mapping pair is chosen. A more sophisticated algorithm, such as that used in the MUC scorer, will eventually be used that attempts to optimize the pairings, in order to give the best possible final score.</Paragraph>
      <Paragraph position="2"> In the general case, there will be reference NEs that do not map to any hypothesis NE, and vice versa. As we shall see below, the unmapped reference NEs are completely missing from the hypothesis, and thus will correspond to recall errors. Similarly, unmapped hypothesis NEs are completely spurious: they precision errors.</Paragraph>
    </Section>
    <Section position="2" start_page="202" end_page="202" type="sub_section">
      <SectionTitle>
2.4 Stage 4: Comparison
</SectionTitle>
      <Paragraph position="0"> Once the mapping phase reference-hypothesis NEs, compared for correctness.</Paragraph>
      <Paragraph position="1"> will be scored as has found pairs of these pa~rs are As indicated above, we compare along three independent components: type, extent and content. The first two components correspond to MUC scoring and preserve backward compatibility. Thus our</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="202" end_page="203" type="metho">
    <SectionTitle>
FROM OTTAWA THIS IS
WHAT THIS 'i IS
</SectionTitle>
    <Paragraph position="0"> algorithm can be used to generate MUC-style NE scores, given two texts that differ only in annotation.</Paragraph>
    <Paragraph position="1"> Type is the simplest of the three components: A hypothesis type is correct only if it is the same as the corresponding reference typer. Thus, in Example 4, hypothesis 1 has an incorrect type, while hypothesis 2 is correct.</Paragraph>
    <Paragraph position="2"> Extent comparison makes further use of the information from the alignment phase. Strict extent comparison requires the first word of the hypothesis NE to align with the first word of the reference NE, and similarly for the last word.</Paragraph>
    <Paragraph position="3"> Thus, in Example 4, hypotheses 1 and 2 are correct in extent, while hypotheses 3 and 4 are not. Note that in hypotheses 2 and 4 the alignment phase has indicated a split between the single reference word GINGRICH and the two hypothesis words GOOD RICH (that is, there is a one- to two-word alignment). In contrast, hypothesis 3 shows the alignment produced by SCLite, which allows only one-to-one alignment.</Paragraph>
    <Paragraph position="4"> In this case, just as in Example 4, extent is judged to be incorrect, since the final words of the reference and hypothesis NEs do not align.</Paragraph>
    <Paragraph position="5"> This strict extent comparison can be weakened by adjusting an extent tolerance. This is defined as the degree to which the first and/or last word of the hypothesis need not align exactly with the corresponding word of the reference NE. For example, if the extent tolerance is 1, then hypotheses 3 and 4 would both be correct in the extent component. The main reason for a non-zero tolerance is to allow for possible discrepancies in the lexeme alignment process-thus the tolerance only comes into play if there are word errors adjacent to the boundary in question (either the beginning or end of the NE). Here, because both GOOD and RICH are errors, hypotheses 3, 4 and 6 are given the benefit of the doubt when the extent tolerance is 1. For  hypothesis 5, however, extent is judged to be incorrect, no matter what the extent tolerance is, due to the lack of word errors adjacent to the boundaries of the entity.</Paragraph>
    <Paragraph position="6"> Content is the score component closest to the standard measures of word error. Using the word alignment information from the earlier phase, a region of intersection between the reference and the hypothesis text is computed, and there must be no word errors in this region. That is, each hypothesis word must align with exactly one reference word, and the two must be identical. The intuition behind using the intersection or overlap region is that otherwise extent errors would be penalized twice. Thus in hypothesis6, even though NEWT is in the reference NE, the substitution error (NEW) does not count with respect to content comparison, because only the region containing GINGRICH is examined. Note that the extent tolerance described above is not used to determine the region of intersection.</Paragraph>
    <Paragraph position="7"> Table 1 shows the score results for each of these score components on all six of the hypotheses in Example 4. The extent component is shown for two different thresholds, 0 and 1 (the latter being the default setting in our implementation).</Paragraph>
    <Section position="1" start_page="203" end_page="203" type="sub_section">
      <SectionTitle>
2.5 Stage 5: Final Computation
</SectionTitle>
      <Paragraph position="0"> After the mapped pairs are compared along all three components, a final score is computed. We use precision and recall, in order to distinguish between errors of commission (spurious responses) and those of omission (missing responses). For a particular pair of reference and hypothesis NE compared in the previous phase, each component that is incorrect is a substitution error, counting against both recall and precision, because a required reference element was missing, and a spurious hypothesis element was present.</Paragraph>
      <Paragraph position="1"> Each of the reference NEs that was not mapped to a hypothesis NE in the mapping phase also contributes errors: one recall error for each score component missing from the hypothesis text.</Paragraph>
      <Paragraph position="2"> Similarly, an unmapped hypothesis NE is completely spurious, and thus contributes three precision errors: one for each of the score components. Finally, we combine the precision and recall scores into a balanced F-measure.</Paragraph>
      <Paragraph position="3"> This is a combination of precision and recall, such that F-- 2PR /(P + R). F-measure is a single metric, a convenient way to compare systems or texts along one dimension 7.</Paragraph>
      <Paragraph position="4"> 7Because F-measure combines recall and precision, it effectively counts substitution errors twice. Makhoul et al. (1998) have proposed an alternate slot error metric</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML