XML Viewer - n04-3003

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/n04-3003_metho.xml
Size: 6,930 bytes
Last Modified: 2025-10-06 14:08:55
<?xml version="1.0" standalone="yes"?>
<Paper uid="N04-3003">
  <Title>Limited-Domain Speech-to-Speech Translation between English and Pashto</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 Overall Architecture
</SectionTitle>
    <Paragraph position="0"> A simple description of the architecture is as follows.</Paragraph>
    <Paragraph position="1"> The system is controlled by the English speaker, who is expected to have greater technological familiarity and who has the benefit of visual feedback on the system performance. Spoken input, in either English or Pashto, is recognized by SRI's small-footprint DynaSpeak(r) recognizer, and an ordered list of hypotheses is produced. The most likely hypothesis is input to SRI's Gemini natural language parser/generator (Dowding et al. 1993), which attempts to parse the speech recognition output. Handling of possible errors or failures will be discussed in Section 3.</Paragraph>
    <Paragraph position="2"> When a successful parse is obtained, Gemini creates a quasi-logical form representing the meaning of the sentence. In general, multiple quasi-logical forms may be created, reflecting differing interpretations of the input sentence. These forms, which are domain independent and serve here as an interlingua, can be ordered by heuristically assigning preferences or dispreferences to the parsing rules applied to create them. Gemini uses a grammar for the target language to generate a translation from the most-preferred interpretation possible, and outputs a textual representation of the translation.</Paragraph>
    <Paragraph position="3"> Theta, a small-footprint concatenative synthesizer from Cepstral LLC (Cepstral LLC 2004), then produces synthetic spoken output in the target language from the textual representation generated by Gemini. The Pashto voice was created specifically for this project.</Paragraph>
    <Paragraph position="4"> An English and a Pashto version of each component are called by a single application which includes a graphical user interface. A screen shot of the interface is shown in Figure 1.</Paragraph>
    <Paragraph position="5"> Figure 1. Screen shot of prototype system interface.</Paragraph>
  </Section>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Redundancy and Handling Infelicities
</SectionTitle>
    <Paragraph position="0"> As in any complex system, performance can differ from the ideal in a number of ways, and it is important for the system to provide alternative ways to support successful communication. Some kinds of subideal performance and recovery approaches are described here.</Paragraph>
    <Paragraph position="1"> The most likely hypothesis output by the speech recognizer may not be exactly what was spoken. When the input speech is English, the English speaker can see whether the most likely hypothesis (shown in the &amp;quot;Input&amp;quot; box in Figure 1) is correct or approximately correct. If the English speaker judges the hypothesis to not be close enough to the intended text, s/he may either repeat the utterance or click on the &amp;quot;Guesses...&amp;quot; button to see an ordered list of the best hypotheses. A sample list is shown in Figure 2. If the correct text is on this list, the user may select it to submit for translation. When the input speech is in Pashto, this functionality is less useful, as the Pashto speaker is not assumed to be able to read, even if the hypotheses were displayed in Pashto  Once a recognition hypothesis has been submitted for translation, several possible problems can arise.</Paragraph>
    <Paragraph position="2"> Pashto is a moderately inflected, split-ergative Indo-European language, and for Pashto in particular, recognition errors may lead to apparent lack of syntactic agreement between elements of the sentence which should (and did in fact) agree. As Gemini tries to generate a full parse of the input, it has the option of using parse rules that relax agreement requirements.</Paragraph>
    <Paragraph position="3"> These rules are dispreferred and a parse built upon them may be kept only if a full, grammatically correct parse cannot be completed. Another possible problem is that unknown words, some grammatical constructions, and input errors may render any full parse unachievable. In this case, fallback strategies can be applied to translate partial parses, fragments, or any known words. Other strategies are currently in development.</Paragraph>
    <Paragraph position="4"> Another class of approaches for assisting communication allows the English speaker to quickly perform certain actions or play high-frequency phrases to the Pashto speaker. If there is background noise or distractions or the TTS quality is not high enough for easy comprehension, pressing the &amp;quot;Replay&amp;quot; button will immediately play back the last translation result. &amp;quot;Ask for Rep&amp;quot; plays a prerecorded sentence asking the Pashto speaker to repeat what s/he said. Several other useful phrases are available by clicking on the &amp;quot;Phrases...&amp;quot; button and then selecting from the screen shown in  be played back with a single click.</Paragraph>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Sample Interaction
</SectionTitle>
    <Paragraph position="0"> The table below shows an excerpt of a dialog between an English and a Pashto speaker, both new to this system, whose interaction was part of an informal trial run by the MITRE Corporation in February 2004 for the DARPA CAST program. The English speaker, a doctor, had just received eighteen minutes of training in how to use the system and had had no other exposure to it. The Pashto speaker, playing the role of an injured patient, had received training in complaints consistent with a particular injury scenario and had seen others use the system, but had not interacted with the system himself.</Paragraph>
    <Paragraph position="1"> Except as noted, the translations are understandable.</Paragraph>
    <Paragraph position="2"> Spoken input System result I am a doctor, can I help you? [failure to translate because two sentences] I am a doctor z@ ddAkttar y@m Can I help you? AyA z@ d@ tA sara komak kaw@lay S@m ho, mehrabAni w@krra yes make [partial translation; full translation should have been &amp;quot;yes please do&amp;quot;] Where are you hurt? [rerecorded after one misrecognition] t@ cherta khugx ye ghwagx aw ugxa [misrecognized repeatedly; unable to give meaningful translation; should have been &amp;quot;ear and shoulder&amp;quot;] Can you breathe? AyA t@ sA akhIst@lay Se na, mUSkel lar@m no I have the problem Do you have pain? AyA t@ dard lare zAyt of much [minor misrecognition; translation should have been &amp;quot;much&amp;quot;] Do you take medications? AyA t@ dawAuna akhle [incorrect plural form of dawA, but understood by</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML