File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/97/w97-0615_metho.xml
Size: 22,201 bytes
Last Modified: 2025-10-06 14:14:42
<?xml version="1.0" standalone="yes"?> <Paper uid="W97-0615"> <Title>Filtering Errors and Repairing Linguistic Anomalies for Spoken Dialogue Systems</Title> <Section position="3" start_page="0" end_page="74" type="metho"> <SectionTitle> 2 System architecture </SectionTitle> <Paragraph position="0"> The system architecture consists in a speech recognizer, a word confidence scoring module, a robust parsing module and higher modules -around a dialogue module (Normand, Pernel, and Bacconnet, 1997).</Paragraph> <Paragraph position="1"> The modules of the system articulate in a complementary way. The scoring module goal is to provide word acoustic confidence scores to help the robust parser in its task. The parsing module takes the best recognition hypothesis. It attempts to repair recognition errors and transmits a semantic representation of the sentence to the dialogue module. It relies on a lexicalized tree grammar and on integrated repairing rules. They make use of the knowledge embedded in the lexical grammar and of candidates present in the N-best hypothesis. We have studied its capacities to detect and predict missing elements and to select syntactically and semantically well-formed sentences. The robust parser needs con- null fidence scoring module to point out inserted and substituted elements.</Paragraph> <Paragraph position="2"> The words identified as inserted or as substituted are marked but the decision is laid upon the robust parsing or subsequent linguistic processes. Moreover, falsely rejected words can give rise to deletion repairing procedures. The robust parsing strategy applies syntactic and semantic well-formedness constraints. It derives the meaning of the sentence out of available elements and furthermore predicts the missing elements required to meet the constraints.</Paragraph> <Paragraph position="3"> Whatever the case, initially well-formed sentence or not, the parsing produces a usable analysis for the higher layers to perform the final interpretation or to trigger a repairing dialogue.</Paragraph> </Section> <Section position="4" start_page="74" end_page="76" type="metho"> <SectionTitle> 3 Word Errors Filtering </SectionTitle> <Paragraph position="0"> Inserted and substituted elements are a major problem as they are a source of misunderstanding. If not treated early on in a spoken dialogue system, they weaken the dialogue interaction, caught between running the risk of confusing the user with irrelevant interactions or annoying the user with repetitive confirmation checks.</Paragraph> <Paragraph position="1"> As parsing is not always able to reject ill-recognized sentences, especially when they remain well-formed, cross-checking is required between acoustic and linguistic information. Our method is to isolate errors according to a scoring criterion and then transmit to the parsing suspected elements with the alternative acoustic candidates. They can be reactivated by the parsing if necessary, to achieve a complete analysis.</Paragraph> <Section position="1" start_page="74" end_page="75" type="sub_section"> <SectionTitle> 3.1 Scoring Method </SectionTitle> <Paragraph position="0"> A way to get a scoring criterion is to attribute a recognition confidence score to each word in the best sentence hypothesis.</Paragraph> <Paragraph position="1"> A confidence score relates to the word being rightly recognized and not only to the word being acoustically close to an acoustic reference. It normally depends on the recognizer behaviour, the language to be recognized, and the application* For example (Rivlin, 1995) sees it as a normalisation of the phonemes acoustic scores and derives an exact estimation from a recognition corpus. We propose here a simple on-line computing of the word confidence score. It is not an exact measure but it has minimal knowledge requirements. The scoring relies on the observation of concurrent hypothesis of the recognizer and their associated acoustic scores.</Paragraph> <Paragraph position="2"> We have tested it with the N-best sentence hypothesis but lattice and word graph could be investigated further.</Paragraph> <Paragraph position="3"> An initial score for each word in the best sentence is taken either from the word acoustic score or from the sentence score, distributed uniformly on the words. The score we have used here is the global sentence acoustic score. This initial word score is re-evaluated on the basis of concordances between the different recognition hypothesis. The major parameter for score estimation is the alignment between the word in the best hypothesis and the words in the other hypothesis. In our case this alignment is achieved by a dynamic programming method 1 For each N-best, an alignment value is defined from the words alignment. It disfavours especially the recidivist occurrences of a word candidate. Let wi be the i th word in the best hypothesis, the alignment value at rank n is: when wi is aligned with itself -1 when wi is not aligned Aln(wi) = -r when wi is aligned for the r th time with a given word</Paragraph> <Paragraph position="5"> The re-evaluation of a word score will derive from this word alignment value.</Paragraph> <Paragraph position="6"> Each N-best gives rise to a re-evaluation of the current word score. This re-evaluation decomposes into two factors, a re-scoring potential V and a re-scoring amplitude AS. Let Sn(wi) be the score of the word wi having observed N-best hypothesis up to rank n: = + (2) Where Vn(wi) is the potential for rescoring the word wi according to hypothesis Hn -the sentence hypothesis at rank n and ASh is the rescoring amplitude at rank n.</Paragraph> <Paragraph position="7"> The first factor of the re-evaluation is the potential, defined in equation 3. It is based on the alignments and indicates the type of increase or decrease that a word deserves. A context effect is introduced in the potential in the form of penalties and bonus which are proportional to the direct neighbouts alignment values (see equation 4), so that:</Paragraph> <Paragraph position="9"> Where Al,(wi) is the alignment value of word wi between the first-best hypothesis H1 and the N-best hypothesis H,. ~Aln(wj, wi) is the context effect of word wj on word wi (equation 4). Practically this is either a positive contribution if wj is well aligned or a negative contribution if wj is badly aligned.</Paragraph> <Paragraph position="10"> We consider context effect only from the immediate neighbours.</Paragraph> <Paragraph position="11"> The second factor of the re-evaluation is the amplitude (cf. equation 2). The amplitude is the same for every word at a given rank. It is based on the n th hypothesis score and the rank so that the amplitude decreases with the rank and with the relative score difference between H1 and H~. It expresses the rescoring power of hypothesis Hn and is calculated iteratively as:</Paragraph> <Paragraph position="13"> Where # is a linear slope that ensures a minimal decrease. S(H,) is the global acoustic score of the hypothesis H..</Paragraph> <Paragraph position="14"> The scoring stops in the case of the amplitude reaching zero. Fig 1 and 2 show evolution of the word score across N-best re-evaluation.</Paragraph> </Section> <Section position="2" start_page="75" end_page="75" type="sub_section"> <SectionTitle> 3.2 Filtering application </SectionTitle> <Paragraph position="0"> Once the word confidence scores are available, the filtering still needs a threshold to point out would-be errors. It is set on-line as the maximum score that different typical cases of words to be eliminated could reach. It is computed in the same time as word confidence scores. We consider the worst case score of several empirical cases independent from the two recognizer we tested. One of those cases is a word that would be not-aligned 80% of the time and always surrounded by aligned neighbours.</Paragraph> <Paragraph position="1"> When the suspect words have been spotted, it remains to be decided whether they are substitutions or insertions. We distinguish them thanks to segmental cues and to local word variations between competitive hypothesis. Practically, the alignments previously calculated are scanned ; if the two bordering neighbours of a word w are once adjacent and well aligned in an hypothesis, w is marked as an insertion. null</Paragraph> </Section> <Section position="3" start_page="75" end_page="76" type="sub_section"> <SectionTitle> 3.3 Evaluation </SectionTitle> <Paragraph position="0"> We have tested the word scoring module, with the incorporated filtering, on errors produced by two existing ASR systems from SRI and Cambridge University. The former, Nuance Communication recognizer system is constrained by a Context Free Grammar. uCtered: DO YOU HAVE SOME RED ARMCHAIRS HI : DO YOU HAVE TWO RED COMPUTERS H2: DO YOU HAVE TWO RED ARMCHAIRS H3: DO YOU HAVE THOSE RED COMPUTERS H4: DO YOU HAVE THE RED COMPUTERS H5: DO YOU HAVE THOSE RED ARMCHAIRS H6: DO YOU HAVE THE RED ARMCHAIRS H7: DO YOU HAVE SOME RED COMPUTERS best hypothesis &quot;do you have two red computers&quot; The latter, Abbot, uses an n-gram model (backed off trigram model) 2 The application domain is taken from the COVEN z project (Normand and Tromp, 1996) , described on http://chinon.thomson-csf.fr/coven/. COVEN (COllaborative Virtual ENvironments) addresses the technical and design-level requirements of Virtual-based multi-participant collaborative activities in professional and citizen-oriented domains. Among the grounding testbed applications, an interior design application is being developed, which provides the background of the work described in this article. A typical interior design scenario deals with composition of pieces of furniture, equipment and decoration in an office room by several partici~The training corpus for the trigram was generated artificially by the context free grammar of the first recognizer mentioned. 15% of the testset is out of the Nuance Context Free Grammar. The sampling rate of acoustic models are 8 kHz for Nuance and 16 kHz for Abbot. The Nuance communication recognizer system exploits phonemes in context. Abbot uses a neural network to model standard phonemes.</Paragraph> <Paragraph position="1"> pants, within the limits of a common budget. Elements of the design are taken from a set of possible furniture, equipment and decoration objects, with variable attributes in value domains. The user may ask information to the system which provides guidance for the user decision.</Paragraph> <Paragraph position="2"> The evaluation results of the speech recognizers are given with others results in table 5. Here are two examples of scoring and filtering. Figure 1 shows the evolution across seven N-best of an ill-recognized sentence score profile. At the end, the two ill-recognized words (some and armchairs) are identified as errors, they are classified as substitutions according to their type of alignment in the different N-best. The recognition hypothesis are displayed in table 1 (the recognizer is Nuance).</Paragraph> <Paragraph position="3"> In the second example table 2 (from Abbot), the word is is inserted, but not in all N-best hypothesis. The confidence scores succeed in pointing is as illrecognized, the alignment considerations will then classify it as an insertion.</Paragraph> <Paragraph position="4"> uttered: CAN YOU GIVE ME THE BUDGET HI: CAN YOU GIVE ME IS A BUDGET H2: CAM YOU GIVE ME IS THE BUDGET H3: CAN YOU GIVE ME A BUDGET H4: CAN YOU GIVE ME IT BUDGET H5: CAN YOU GIVE IT THE BUDGET H6: CAN YOU GIVE ME THE BUDGET H7: CAN YOU GIVE ME THESE BUDGET give me the budget&quot; ited performances for the filtering taken alone and we suspect that even with future improvements, it will remain limited. A better filtering can only be achieved if it is informed by other knowledge sources. Performances of filtering, when coupled with the robust parsing, are indeed much more satisfactory.</Paragraph> </Section> </Section> <Section position="5" start_page="76" end_page="79" type="metho"> <SectionTitle> 4 Repairing Parsing Strategy </SectionTitle> <Paragraph position="0"> The aim of the robust parser presented here is to build a semantic representation needed by higher layers of the system while faced with possible ill-formed sentences. The parsing itself is led by a Lexicalized Tree Grammar (Schabes, Abeill~, and Joshi, 1988). It relies on a set of elementary trees (defined in the lexicon) which have at least one terminal symbol on its frontier, called the anchor. Trees can be combined through two simple operations : substitution 4 and furcation (de Smedt and Kempen, 1990). Those operations are theoretically equivalent to Tree Adjoining Grammar operations. However an original property of our Lexicaiized Tree Grammar is to integrate a set of semantic operations which lay down additional constraints. The parser handles semantic features, attached to the trees, and propagates them according to specific rules (Roussel, 1996). The result is a semantic representation built synchronously to the syntactical tree.</Paragraph> <Paragraph position="1"> best hypothesis &quot;can you give me is a budget&quot; First evaluation of the filtering hints that it may be a good guidance but not a sufficient criterion: some parameter settings, such as the threshold, remain problematic. Table 5 displays rather lim- null features for the sentence &quot;give me more information about the company&quot; In figures 3 and 4, the heads of the trees are standard syntactic categories, the star symbol on the 4It should be borne in mind that the term substitution when speaking of Tree Grammars has nothing to do with the term substitution that refers to a recognition error right or left of the head indicates an auxiliary tree that will combine with a compatible tree ; a X* head symbol indicates a tree which combines with a node of category X on its right, a *X node combines with a node X on its left. Nodes X0, X1, or more generally Xn, are substitution sites, they are awaiting a tree whose head symbol is X. Substitution sites bear syntactic and semantic constraints on their possible substitutors. Here, the semantic constraints are made visible in the node symbol (e.g. N0-PERSON means the substitutor of this node must be of category N -noun- and must possess a semantic feature :PERSON).</Paragraph> <Paragraph position="2"> The parsing reveals, through linguistic anomalies, errors that wouldn't be spotted efficiently by acoustic criteria. The linguistic context allows to enrich * and complete the analysis in case of an error, either detected during the parsing as a linguistic anomaly or signalled previously from confidence scores.</Paragraph> <Paragraph position="3"> Actually, the robust parsing strategy articulates around a single parser, which is used iteratively according to the anomalies encountered. Three passes can each provide analysis when anomalies are detected -for correct sentences, the first pass is sufficient. Each pass will in turn modify the result of the previous pass and hand it back to the parser.</Paragraph> <Paragraph position="4"> In the first pass, lexical items are first matched with their corresponding elementary tree in the lexicon. Concurrent trees for one item give rise to parallel concurrent branches of parsing, but they are taken into account in a local chart parsing.</Paragraph> <Paragraph position="5"> For example the verb want is associated in the COVEN s lexicon with two entries, one for the infinitive construction and one for the transitive construction. As preposition to exists in the lexicon, a sentence in which the words want and to appear calls two lexicon matching, thus two parsing branches.</Paragraph> <Paragraph position="6"> Figure 4 displays the trees involved. The parser will select the right matching along the syntactico-semantic operations thanks to expectations of substitution sites.</Paragraph> <Paragraph position="7"> The first pass includes a first feature of robustness since unreliable words signalled by the filtering as probable substitutions are represented by an automatically generated &quot;joker&quot; tree. A joker tree is an overspecified tree that cumulates semantic features from different candidates whose elementary tree share the same structure 6. Several alternative joker trees are generated when word candidates belong to different categories. Initially all semantic features in an overspecified joker tree are marked 5cf. section 3.2 8joker trees are similar to elementary tree. They can also be defined manually to fit identified cases semantic features for the words want to as uncertain, not to confuse the higher levels, then, during the parsing the semantic features mobilised for the tree operations are relieved from their uncertain status. To avoid a heavy combinatorial search, directly operations to combine two adjacent jokers are not attempted.</Paragraph> <Paragraph position="8"> Concerning insertions, the parser checks whether a local analysis is possible without a word suspected to be inserted, if so, the decision is made to eliminate the word, if not, the word is considered as substitution, and processed as described above. This is not an absolute criterion, in particular optional words falsely considered to be insertions by the filtering are not recovered.</Paragraph> <Paragraph position="9"> The repairing capacities at this stage apply for instance to the case mentioned table 2. In sentence &quot;can you give me is a budget&quot;, the word a is marked as a substitution (cf. 3.2). It triggers the genera- null tion of joker trees, the candidates a, the, this, these are represented by a single joker tree while it, in the 4 ~h best hypothesis, involves a different joker tree -it is in fact its own tree, but with semantic features marked as uncertain. The branch of parsing containing this joker is eliminated on syntactic grounds, whereas the first branch of parsing turns into a complete analysis (figure 5). The word is which is marked as a possible insertion is confirmed in its status and definitely eliminated.</Paragraph> <Paragraph position="10"> The second pass aims at recovering from would-be deleted words by re-inserting expected co-occurring words. We use knowledge about co-occurrences implicitly described in some elementary trees: elementary trees defined for more than one anchor are now being selected even if all their anchors are not present in the recognized sentence. It is however checked whether the anchors appear in given competitive recognition hypothesis at compatible positions 7. In the following example in table 3 the recognizer (here, Abbot) has recognized the sentence whom is this chair are too light instead of the actual utterance whom is this chair chosen by.</Paragraph> <Paragraph position="11"> uttered: WHOM IS THIS CHAIR CHOSEN BY is this chair chosen by&quot; The sequence are ~oo light is spotted by the filtering as a probable substitution. At pass one, the parser doesn't succeed in putting together the elementary trees which span the whole sentence.</Paragraph> <Paragraph position="12"> Now, in pass two it is observed that in the sure part of the sentence whom is this chair, two words whom and be are the beginning of several multi-anchor elementary trees. The aligned candidates with the sequence are too light allow to select only one multi-anchor tree WHOM-BE-N1-CHOSEN-BY. This provides a complete analysis.</Paragraph> <Paragraph position="13"> The second pass enables a lexical recovery. The knowledge exploited here about dependencies between words at arbitrary distance can operate particularly efficiently with an n-gram driven recognizer. Indeed, the co-occurrences captured by an n-gram model suffer from a limited scope and an adjacency condition.</Paragraph> <Paragraph position="14"> ~The position is figured out from the hypothesis alignment, see section 3.1 the origninal sentence is recovered The third pass differs from previous passes ; instead of initiating the recovery from the lexical elements at hand, it summons predictions from the grammatical expectations.</Paragraph> <Paragraph position="15"> This pass is meant to detect the other errors and complete the analysis with underspecified elements.</Paragraph> <Paragraph position="16"> Each anomaly revealed by the parsing has the trees around it examined to determine whether it is possible to restore a local well-formedness by inserting a tree.</Paragraph> <Paragraph position="17"> Patterns of anomaly that fits in this case are defined in a compact way thanks to the general tree types used in the grammar. There are about twenty patterns, each of them is made to insert the required tree, in the form of an underspecified joker tree. This type of joker tree has a full syntactic structure but undefined semantic features: some semantic features can be added along the syntactico semantic operations. null The third pass can chose to ignore joker trees introduced in the first pass. This allows to correct irrelevant matching of joker in the first pass. This occurs when two words are substituted for a single word, or when an insertion is classified as a substitution. null uttered: CAN YOU GIVE ME HOP.E INFORMATION ABOUT TItE COMPANY give me more information about the company&quot; Example table 4 stands for a typical omission recovery. The word about was deleted so that neither of the first passes can span the entire sentence. The third pass succeeds in inferring an analysis by inserting a generic prepositional tree that meets the syntactic and semantic expectations (see figure 7).</Paragraph> <Paragraph position="18"> Yet the recovery lets the information introduced by wrong filtering, sentence rightly rejected by the parsing wrong filtering, sentence falsely analysed as well formed wrong filtering, sentence analysed through the robust parsing for different parsing options the preposition undefined. However a look at compatible aligned words in the N-best hypothesis can instanciate the joker once an analysis is found.</Paragraph> </Section> class="xml-element"></Paper>