XML Viewer - p98-1028

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/98/p98-1028_metho.xml
Size: 8,907 bytes
Last Modified: 2025-10-06 14:14:57
<?xml version="1.0" standalone="yes"?>
<Paper uid="P98-1028">
  <Title>Beyond N-Grams: Can Linguistic Sophistication Improve Language Modeling?</Title>
  <Section position="4" start_page="187" end_page="188" type="metho">
    <SectionTitle>
2 Net Human Improvement
</SectionTitle>
    <Paragraph position="0"> The first question to ask is whether people are able to improve upon the speech recognizer's output by postprocessing the n-best lists. For 3 Note that what we are really measuring is an upper bound on improvement under the paradigm of n-best postprocessing. This is a common technique in speech recognition, but it results in the postprocessor not having access to the entire set of hypotheses, or to full acoustic information.</Paragraph>
    <Paragraph position="1"> 4 HTK software was used to build all recognizers.</Paragraph>
    <Paragraph position="2"> s This program is available at http:llwww.cs.jhu.edullabslnlp each corpus, we have four measures: (1) the recognizer's word error rate, (2) the oracle error rate, (3) human error rate when choosing among the 10-best (human selection) and (4) human error rate when allowed to posit any word sequence (human edit).</Paragraph>
    <Paragraph position="3"> The oracle error rate is the upper bound on how well anybody could do when restricted to choosing between the 10 best hypotheses: the oracle always chooses the string with the lowest word error rate. Note that if the human always picked the highest-ranking hypothesis, then her accuracy would be equivalent to that of the recognizer. Below we show the results for each corpus, averaged across the subjects:  In the following table, we show the results as a function of what percentage of the difference between recognizer and oracle the humans are able to attain. In other words, when the human is not restricted to the 10-best list, he is able to advance 75.5% of the way between recognizer and oracle word error rate on the</Paragraph>
    <Section position="1" start_page="187" end_page="188" type="sub_section">
      <SectionTitle>
and Oracle
</SectionTitle>
      <Paragraph position="0"> There are a number of interesting things to note about these results. First, they are quite encouraging, in that people are able to improve the output on all corpora. As the accuracy of the recognizer improves, the relative human improvement increases. While people can attain over three-quarters of the possible word error rate reduction over the recognizer on Wall Street Journal, they are only able to attain 25.9% of the possible reduction in Switchboard. This is probably attributable to two causes. The more  varied the language is in the corpus, the harder it is for a person to predict what was said. Also, the higher the recognizer word error rate, the less reliable the contextual cues will be which the human uses to choose a lower error rate string. In Switchboard, over 40% of the words in the highest ranked hypothesis are wrong.</Paragraph>
      <Paragraph position="1"> Therefore, the human is basing her judgement on much less reliable contexts in Switchboard than in the much lower word error rate Wall Street Journal, resulting in less net improvement. For all three corpora, allowing the person to edit the output, as opposed to being limited to pick one of the ten highest ranked hypotheses, resulted in significant gains: over 50% for Switchboard and Broadcast News, and 30% for Wall Street Journal. This indicates that within the paradigm of n-best list postprocessing, one should strongly consider methods for editing, rather than simply choosing.</Paragraph>
      <Paragraph position="2"> In examining the relative gain over the recognizer the human was able to achieve as a function of sentence length, for the three different corpora, we observed that the general trend is that the longer the sentence is, the greater the net gain is. This is because a longer sentence provides more cues, both syntactic and semantic, that can be used in choosing the highest quality word sequence. We also observed that, other than the case of very low oracle error rate, the more difficult the task is the lower the net human gain. So both across corpora and corpus-internal, we find this relationship between quality of recognizer output and ability of a human to improve upon recognizer output.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="188" end_page="188" type="metho">
    <SectionTitle>
3 Usefulness of Linguistic Information
</SectionTitle>
    <Paragraph position="0"> In discussions with the participants after they ran the experiment, it was determined that all participants essentially used the same strategy. When all hypotheses appeared to be equally bad, the highest-ranking hypothesis was chosen. This is a conservative strategy that will ensure that the person does no worse than the recognizer on these difficult cases. In other cases, people tried to use linguistic knowledge to pick a hypothesis they felt was better than the highest ranked hypothesis.</Paragraph>
    <Paragraph position="1"> In Figure 2, we show the distribution of proficiencies that were used by the subjects. We show for each of the three corpora, the percentage of 10-best instances for which the person used each type of knowledge (along with the ranking of these percentages), as well as the net gain over the recognizer accuracy that people were able to achieve by using this information source. For all three corpora, the most common (and most useful) proficiency was that of closed class word choice, for example confusing the words in and and, or confusing than and that. It is encouraging that although world knowledge was used frequently, there were many linguistic proficiencies that the person used as well. If only world knowledge accounted for the person's ability to improve upon the recognizer's output, then we might be faced with an AI-complete problem: speech recognizer improvements are possible, but we would have to essentially solve AI before the benefit could be realized.</Paragraph>
    <Paragraph position="2"> One might conclude that although people were able to make significant improvements over the recognizer, we may still have to solve linguistics before these improvements could actually be realized by any actual computer system. However, we are encouraged that algorithms could be created that can do quite well at mimicking a number of proficiencies that contributed to the human's performance improvement. For instance, determiner choice was a factor in roughly 25% of the examples for the Wall Street Journal.</Paragraph>
    <Paragraph position="3"> There already exist algorithms for choosing the proper determiner with fairly high accuracy (Knight(1994)). Many of the cases involved confusion between a relatively small set of choices: closed class word choice, determiner choice, and preposition choice. Methods already exist for choosing the proper word from a fixed set of possibilities based upon the context in which the word appears (e.g. Golding(1996)).</Paragraph>
    <Paragraph position="4"> Conclusion In this paper, we have shown that humans, by postprocessing speech recognizer output, can make significant improvements in accuracy over the recognizer. The improvements increase with the recognizer's accuracy, both within a particular corpus and across corpora. This demonstrates that there is still a great deal to gain without changing the recognizer's internal models, and simply operating on the recognizer's output. This is encouraging news, as it is typically a much simpler matter to do postprocessing than to attempt to integrate a knowledge source into the recognizer itself.</Paragraph>
    <Paragraph position="5"> We have presented a description of the proficiencies people used to make these improvements and how much each contributed to the person's success in improving over the recognizer accuracy. Many of the gains involved linguistic proficiencies that appear to be solvable (to a degree) using methods that have been recently developed in natural language processing. We hope that by honing in on the specific high-yield proficiencies that are amenable to being solved using current technology, we will finally advance beyond ngrams. null There are four primary foci of future work. First, we want to expand our study to include more people. Second, now that we have some picture as to the proficiencies used, we would like to do a more refined study at a lower level of granularity by expanding the repertoire of proficiencies the person can choose from in describing her decision process. Third, we want to move from what to how: we now have some idea what proficiencies were used and we would next like to establish to the extent we can how the human used them. Finally, eventually we can only prove the validity of our claims by actually using what we have learned to improve speech recognition, which is our ultimate goal.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML