File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/n03-1005_metho.xml

Size: 19,360 bytes

Last Modified: 2025-10-06 14:08:08

<?xml version="1.0" standalone="yes"?>
<Paper uid="N03-1005">
  <Title>Automatic Acquisition of Names Using Speak and Spell Mode in Spoken Dialogue Systems</Title>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Approach
</SectionTitle>
    <Paragraph position="0"> The approach adopted in this work utilizes a multi-pass strategy consisting of two recognition passes on the spoken waveform. The first-stage recognizer extracts the  spelled letters from the spoken utterance, treating the pronounced portion of the word as a generic OOV word. This is followed by an intermediate stage, where the hypotheses of the letter recognition are used to construct a pruned search space for a final sound-to-letter recognizer which directly outputs grapheme sequences. The ANGIE framework serves two important roles simultaneously: specifying the sound/letter mappings and providing language model constraints. The language model is enhanced with a morph N-gram, where the morph units are derived via corpus-based techniques. In the following sections, we first describe the ANGIE framework, followed by a detailed description of the multi-pass procedure for computing the spelling and pronunciation of the word from a waveform.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.1 ANGIE Sound-to-Letter Framework
</SectionTitle>
      <Paragraph position="0"> ANGIE is a hierarchical framework that encodes sub-word structure using context-free rules and a probability model. When trained, it can predict the sublexical structure of unseen words, based on observations from training data. The framework has previously been applied in bi-directional letter/sound generation (Seneff et al., 1996), OOV detection in speech recognition (Chung, 2000), and phonological modeling (Seneff and Wang, 2002).</Paragraph>
      <Paragraph position="1"> A parsing algorithm in ANGIE produces regular parse trees that comprise four distinct layers, capturing linguistic patterns pertaining to morphology, syllabification, phonemics and graphemics. An example parse for the word &amp;quot;Benjamin&amp;quot; is given in Figure 1. Encoded at the pre-terminal-to-terminal layers are letter-sound mappings. The grammar is specified through context-free rules; context dependencies are captured through a superimposed probability model. The adopted model is motivated by the need for a balance between sufficient context constraint and potential sparse data problems from a finite observation space. It is also desirable for the model to be locally computable, for practical reasons associated with the goal of attaching the learned probabilities to the arcs in a finite state network. Given these considerations, the probability formulation that has been developed for ANGIE can be written as follows:  where a36a38a37 is the a39a34a40a4a41 column in the parse tree and a36a42a37a44a43</Paragraph>
      <Paragraph position="3"> a37a4a47a48 is the label at the a53a58a40a4a41 row of the a39a34a40a4a41 column in the two-dimensional parse grid. a56 is the total number of layers in the parse tree. a39 and a53 start at the bottom left corner of the parse tree. In other words, each letter is predicted based on the entire preceding column, and the column probability is built bottom-up based on a trigram model, considering both the child and the left sibling in the grid. The probabilities are trained by tabulating counts in a corpus of parsed sentences.</Paragraph>
      <Paragraph position="4"> After training, the ANGIE models can be converted into a finite state transducer (FST) representation, via an algorithm developed in (Chung, 2000). The FST compactly represents sound-to-letter mappings, with weights on the arcs encoding mapping probabilities along with subword structure. In essence, it can be considered as a bigram model on units identified as vertical columns of the parse tree. Each unit is associated with a grapheme and a phoneme pronunciation, enriched with other contextual factors such as morpho-syllabic properties. The FST output probabilities, extracted from the ANGIE parse, represent bigram probabilities of a column sequence. While efficient and suitable for recognition search, this column bigram FST preserves the ability to generalize to OOV data from observations made at training. That is, despite having been trained on a finite corpus, it is capable of creatively licensing OOV words with non-zero probabilities.</Paragraph>
      <Paragraph position="5"> In this work, the probability model was trained on a lexicon of proper nouns, containing both first and last names. During the initial lexical acquisition phase, over 75,000 entries were added to the lexicon via an automatic procedure. Because this yielded many errors, manual corrections have been made, and are ongoing. In a second phase, a further 25,000 names are automatically added to the lexicon, using a two-step procedure. First, the grammar is trained on the original 75,000 words, then using the trained grammar, ANGIE is used to parse the additional 25,000 new names. These parses are immediately added to the full lexicon. Despite generating many erroneous parses, performance improved with the additional training data. After training on the total 100,000 words, the column bigram FST is highly compact, containing around 2100 states and 25,000 arcs. In total, there are 214 unique graphemes (some of which are doubletons such as &amp;quot;th&amp;quot;) and 116 unique phoneme units.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.2 Multi-Stage Speak and Spell Recognition
</SectionTitle>
      <Paragraph position="0"> The multi-stage speak and spell approach is tailored to accommodate utterances with a spoken name followed by the spelling of the name. As depicted in Figure 2, there are three stages: the first is a letter recognizer with an unknown word model, outputting a reduced search space favored by the letter hypotheses; the second pass compiles the language models and sound-to-letter mappings into the reduced search space; a final pass uses the scores and search space defined in the previous stage to perform recognition on the waveform, simultaneously generating spelling and phonemic sequences on the word.</Paragraph>
      <Paragraph position="1"> At the core of this approach is the manipulation of FSTs, which permits us to flexibly reconfigure the search space during recognition time. The entire linguistic search space in the recognizer can be represented by a single FST (a0 ) which embeds all the language model probabilities at the arc transitions. Generally, a0 is represented by a cascade of FST compositions:</Paragraph>
      <Paragraph position="3"> where a36 contains diphone label mappings, a3 applies phonological rules, a6 maps the lexicon to phonemic pronunciations, and a8 is the language model. The above compositions can be performed prior to run-time or on the fly.</Paragraph>
      <Paragraph position="4">  The first stage is a simple letter recognizer augmented with an OOV word model (Bazzi and Glass, 2001), which is designed to absorb the spoken name portion of the waveform. The recognition engine is segment-based, using context-dependent diphone acoustic units (Zue et al., 2000). Trained on general telephone-based data (which do not contain spelled names), the acoustic models contain 71 phonetic units and 1365 diphone classes. Using Bazzi's OOV word modeling scheme, unknown words are represented by variable-length subword units that have been automatically derived. The language model, a letter 4-gram, is trained on a 100,000 name corpus, augmented with an unknown word at the beginning of each sentence. This first stage outputs a lattice in the form of an FST, which contains, at the output labels, an unknown word label for the spoken name part of the utterance and letter hypotheses which are useful for the later stages.  A series of FST operations are performed on the output of the first stage, culminating in an FST that defines a reduced search space and integrates several knowledge sources, for the second recognition pass. Since the waveform consists of the spoken word followed by the spelling, the output FST of this stage is the concatenation of two component FSTs that are responsible for recognizing the two portions of the waveform: a first FST maps phone sequences directly to letters, and a second FST, which supports the spelling component, maps phones to the spelled letters.</Paragraph>
      <Paragraph position="5"> The first FST is the most knowledge-intensive because it integrates the first pass hypotheses with their corresponding scores, together with additional language models and ANGIE sound-to-letter mappings. A subword tri-gram language model is applied to subword units that are automatically derived via a procedure that maximizes mutual information. Similar to work in (Bazzi and Glass, 2001), where subword units are derived from phones, the procedure employed here begins with letters and iteratively combines them to form larger units.</Paragraph>
      <Paragraph position="6"> The following describes the step-by-step procedure for generating such a final FST (a10 ) customized for each specific utterance, beginning with an input lattice (a11 ) from the first stage. a11 preserves the acoustic and language model scores of the first stage.</Paragraph>
      <Paragraph position="7"> 1. Apply subword language model: a11 is composed with a subword trigram (a12 ). The trigram is applied early because stronger constraints will prune away improbable sequences, reducing the search space.</Paragraph>
      <Paragraph position="8"> The composition involves a6a14a13 , mapping letter sequences to their respective subword units and a6</Paragraph>
      <Paragraph position="10"> a8 a25 codifies language information from ANGIE, a subword trigram, and restrictions imposed by the letter recognizer. Given a letter sequence, a8 a25 outputs phonemic hypotheses.</Paragraph>
      <Paragraph position="11"> 3. Apply phonological rules: The input and output sequences of a8 a25 are reversed to yield a8</Paragraph>
      <Paragraph position="13"> This expands ANGIE phoneme units to allowable phonetic sequences for recognition, in accordance with a set of pronunciation rules, using an algorithm described in (Hetherington, 2001). The resultant FST (a10 a16 ) is a pruned lattice that embeds all the necessary language information to generate letter hypotheses from phonetic sequences.</Paragraph>
      <Paragraph position="14">  4. Create second half FST: The FST (a10 a25 ) necessary for  processing the spelling part of the waveform is constructed. This begins by composing a11 , the FST containing letter hypotheses from the first stage, with an FST (a6a4a30 ) representing baseforms for the letters, followed by the application of phonological rules, similar to Step 3.</Paragraph>
      <Paragraph position="15">  letter hypotheses, is input to an intermediate stage, where a10 a16 (see steps 1 to 3 in Section 3.2.2) is concatenated with a10 a25 (step 4). The result a10 defines the search space for the final stage. 5. Concatenate two parts: The final FST (a10 ) is created by concatenating the FSTs corresponding with the first (a10 a16 ) and second (a10 a25 ) portions of the speak and spell waveform.</Paragraph>
      <Paragraph position="17"> As described above, a10 a16 is particularly rich in knowledge constraints, because all the scores of the first stage are preserved. These are acoustic and language model scores associated with those hypotheses, determined from the spelled part of the waveform. Hence a10 a16 contains hypotheses that are favored by the language and acoustics scores in the letter recognition pass, to be applied to the spoken part of the waveform in the next pass.</Paragraph>
      <Paragraph position="18"> The scores are enriched with an additional subword tri-gram and the ANGIE model to select plausible sound-to-letter mappings.</Paragraph>
      <Paragraph position="19">  The sound-to-letter recognizer conducts an entirely new search, using the enriched language models in a reduced search space, along with the original acoustic measurements from the first pass. Mapping phonetic symbols to letter symbols, the input FST (a10 ) is equivalent to a3 a1 a6 a1 a8 , incorporating phonological rules and language constraints. It is then composed on-the-fly with a pre-loaded diphone-to-phone FST (a36 ), thereby completing the search space as defined in Equation 1.</Paragraph>
      <Paragraph position="20"> The final letter hypothesis for the name is extracted from the output corresponding to the spoken name portion of the utterance, taken from the highest scoring path. Essentially, this final pass integrates acoustic information from the spelled and spoken portions of the waveform, with language model information from the grapheme-phoneme mappings and the morph N-gram.</Paragraph>
      <Paragraph position="21">  Phoneme extraction is performed using an additional pass through the search engine of the recognizer. In the ORION system, the phoneme sequence is only computed after the user has confirmed the correct spelling. The procedure is analogous to the sound-to-letter process described above, except that, instead of using output from the first-stage letter recognizer, a single letter sequence constrains the search. The sequence may either be the answer as confirmed by the user during dialogue, or the highest scoring letter sequence output from the sound-to-letter recognizer. A series of FST compositions is performed to create an FST that can compute a phonemic sequence in accordance with ANGIE model mappings, associated with the given letter sequence and the acoustic waveform. Again, the FST contains two portions, for processing each half of the speak and spell waveform. The first applies ANGIE to map phonetic symbols to phonemic symbols, restricted to paths that correspond with the input letter sequence. The second half supports the spelled letter sequence. Following FST creation, the final FST is uploaded to the search engine, which conducts a new search using the FST and the original acoustic measurements. The phoneme sequence for the name is taken as the output from the highest scoring path corresponding with the spoken part of the waveform.</Paragraph>
      <Paragraph position="22">  plementation for the speak and spell system.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 System Integration
</SectionTitle>
    <Paragraph position="0"> Our integration experiments were conducted in the ORION system, which is based on the GALAXY Communicator architecture (Seneff et al., 1998). In GALAXY, a central hub controls a suite of specialized servers, where interaction is specified via a &amp;quot;hub program&amp;quot; written in a scripting language.</Paragraph>
    <Paragraph position="1"> In order to carry out all of the activities required to specify, confirm, and commit new words to the system's working knowledge, several augmentations were required to the pre-existing ORION system. To facilitate the automatic new word acquisition process, two new servers have been introduced: the FST constructor and the system update server. The role of the FST constructor is to perform the series of FST compositions to build FST a10 as described previously. Via rules in the hub program, the constructor processes the output of the letter recognizer to derive an FST that becomes input to the final sound-to-letter recognizer.</Paragraph>
    <Paragraph position="2"> The second new server introduced here is the system update server, which comes into play once the user has confirmed the spelling of both their first and last names.</Paragraph>
    <Paragraph position="3"> At this point, the NL server is informed of the new word additions. It has the capability to update its trained grammar both for internal and external use. It also creates a new lexicon and class a0 -gram for the recognizer.</Paragraph>
    <Paragraph position="4"> In addition to the NL update, the recognizer also needs to incorporate the new words into its search space. At present, we are approaching this problem by recompiling and reloading the recognizer's search FSTs. In the future, we plan to augment the recognizer to support incremental update of the lexical and language models. The system update server is tasked with re-generating the FSTs asynchronously, which are then automatically reloaded by the recognizer. Both the recognizer and the NL system are now capable of processing the newly specified name, a capability that will not be needed until the next time the new user calls the system.</Paragraph>
    <Paragraph position="5"> One interesting aspect of the implementation for the above processing is that the system is able to make use of parallel threads so that the user does not experience delays while their name is being processed through the multiple stages. Figure 3 illustrates a block diagram of the dialogue flow. The letter recognizer processes the user's first name during the main recognition cycle of the turn. Subsequently, a parallel second thread is launched, in which the second stage recognizer searches the FST created by the FST constructor as described previously.</Paragraph>
    <Paragraph position="6"> In the mean time, the main hub program continues the dialogue with the user, asking for information such as contact phone numbers and email address. The user's last name is processed similarly. At the end of the dialogue, the system confirms the two names with the user.</Paragraph>
    <Paragraph position="7"> If they are verified, a system update is launched, while the system continues the dialogue with the user, perhaps enrolling their first task. If the user rejects a proposed spelling, the system will prompt them for a keypad entry of the name (Chung and Seneff, 2002), which will provide additional constraints.</Paragraph>
  </Section>
  <Section position="6" start_page="0" end_page="0" type="metho">
    <SectionTitle>
5 Experiments
</SectionTitle>
    <Paragraph position="0"> Experiments have been conducted to evaluate the ability to acquire spellings and pronunciations for an open set of names. We have selected a test set that combines utterances from a preliminary ORION data collection during new user enrollment and previous utterances collected from the JUPITER system (Zue et al., 2000), where at the beginning of each phone call, users are asked to speak and spell their names in a single utterance.</Paragraph>
    <Paragraph position="1"> Thus far, 80% of the test set comes from JUPITER data, in which users mostly provided first names. However, the trained models are designed to support both first and last names. As yet, no attempts have been made to separately model first and last names.</Paragraph>
    <Paragraph position="2"> Two test sets are used for evaluation. Test Set A contains words that are present in ANGIE's 100K training vocabulary with 416 items of which 387 are unique; Test Set B contains words that are previously unseen in any of the training data, with 219 items of which 157 are unique.</Paragraph>
    <Paragraph position="3"> These test sets have been screened as best as possible to ensure that the spelled component corresponds to the spoken name in the utterance.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML