File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/94/h94-1040_metho.xml

Size: 11,991 bytes

Last Modified: 2025-10-06 14:13:49

<?xml version="1.0" standalone="yes"?>
<Paper uid="H94-1040">
  <Title>COMBINING KNOWLEDGE SOURCES TO REORDER N-BEST SPEECH HYPOTHESIS LISTS</Title>
  <Section position="4" start_page="0" end_page="217" type="metho">
    <SectionTitle>
2. COMBINING KNOWLEDGE
SOURCES
</SectionTitle>
    <Paragraph position="0"> Different knowledge sources (KSs) can be combined. We begin by assuming the existence of a training corpus of N-best lists produced by the recognizer, e~h list tagged with a &amp;quot;reference sentence&amp;quot; that determines which (if any) of the hypotheses in it was correct. We analyse each hypothesis H in the corpus using a set of possible KSs, each of which associates some form of information with H. Information can be of two different kinds. Some KSs may directly produce a number that can be viewed as a measure of H's plausibility.</Paragraph>
    <Paragraph position="1"> Typical examples are the score the recognizer assigned to H, and the score for whether or not H received a linguistic analysis (1 or 0, respectively). More commonly, however, the KS will produce a list of one or more &amp;quot;linguistic items&amp;quot; associated with H, for example surface N-grams in H or the grammar rules occurring in the best linguistic analysis of H, if there was one. A given linguistic item L is associated with a numerical score through a &amp;quot;discrimination function&amp;quot; (one function for each type of linguistic item), which summarizes the relative frequencies of occurrence of L in correct and incorrect hypotheses, respectively. Discrimination functions are discussed in more detail shortly. The score assigned to H  by a KS of this kind will be the sum of the discrimination scores for all the linguistic items it finds. Thus, each KS will eventuMly contribute a numerical score, possibly via a discrimination function derived from an analysis of the training corpus.</Paragraph>
    <Paragraph position="2"> The totM score for each hypothesis is a weighted sum of the scores contributed by the various KSs. The final requirement is to use the training corpus a second time to compute optimal weights for the different KSs. This is an optimization problem that can be approximately solved using the method described in \[3\] 1 .</Paragraph>
    <Paragraph position="3"> The most interesting role in the above is played by the discrimination functions. The intent is that linguistic items that tend to occur more frequently in correct hypotheses than incorrect ones will get positive scores; those which occur more frequently in incorrect hypotheses than correct ones will get negative scores. To take an example from the ATIS domain, the trigram a list of is frequently misrecognized by DECIPHER TM as a list the. Comparing the different hypotheses for various utterances, we discover that if we have two distinct hypotheses for the same utterance, one of which is correct and the other incorrect, and the hypotheses differ by one of them containing a list o\] while the other contains a list the, then the hypothesis containing a list o\] is nearly always tile correct one. This justifies giving the trigram a list o\] a positive score, and the trigram a list the a negative one.</Paragraph>
    <Paragraph position="4"> We now define formally the discrimination function dT for a given type T of linguistic item. We start by defining dT as a function on linguistic items. As stated above, it is then extended in a natural way to a function on hypotheses by defining dT(H) for a hypothesis H to be ~ dT(L), where the sum is over all the linguistic items L of type T associated with H.</Paragraph>
    <Paragraph position="5"> dT(L) for a given linguistic item L is computed as follows.</Paragraph>
    <Paragraph position="6"> (This is a sfight generalization of the method given in \[4\].) The training corpus is analyzed, and each hypothesis is tagged with its set of associated linguistic items. We then find all possible 4-tuples (U, H1, H2, L) where * U is an utterance.</Paragraph>
    <Paragraph position="7"> * H1 and H2 are hypotheses for U, exactly one of which is correct.</Paragraph>
    <Paragraph position="8"> * L is a linguistic item of type T that is associated with exactly one of H1 and H2.</Paragraph>
    <Paragraph position="9"> If L occurs in the correct hypothesis of the pair (Ha, H2), we call this a &amp;quot;good&amp;quot; occurrence of L; otherwise, it is a &amp;quot;bad&amp;quot; one. Counting occurrences over the whole set, we let g be the total number of good occurrences of L, and b be the total number of bad occurrences. The discrimination score of type T for L, dT(L), is then defined as a function d(g, b). It seems sensible to demand that d(g, b) has the following properties:</Paragraph>
    <Paragraph position="11"> We have experimented with a number of possible such functions, the best one appearing to be the following:</Paragraph>
    <Paragraph position="13"> This formula is a symmetric, logarithmic transform of the function (g + 1)/(g -t- b + 2), which is the expected a posteriori probability that a new (U, Ha,H2, L) 4-tuple will be a good occurrence, assuming that, prior to the quantities g and b being known, this probability has a uniform a priori distribution on the interval \[0,1\].</Paragraph>
    <Paragraph position="14"> One serious problem with corpus-based measures like discrimination functions is data sparseness; for this reason, it will often be advantageous to replace the raw linguistic items L with equivalence classes of such items, to smooth the data.</Paragraph>
    <Paragraph position="15"> We will discuss this further in Section 3.2.</Paragraph>
  </Section>
  <Section position="5" start_page="217" end_page="218" type="metho">
    <SectionTitle>
3. EXPERIMENTS
</SectionTitle>
    <Paragraph position="0"> Our experiments tested the general methods that we have outlined.</Paragraph>
    <Section position="1" start_page="217" end_page="218" type="sub_section">
      <SectionTitle>
3.1. Experimental Set-up
</SectionTitle>
      <Paragraph position="0"> The experiments were run on the 1001-utterance subset of the ATIS corpus used for the December 1993 evaluations, which was previously unseen data for the purposes of the experiments. The corpus, originally supplied as waveforms, was processed into N-best lists by the DECIPHER TM recognizer.</Paragraph>
      <Paragraph position="1"> The recognizer used a class bigram language model. Each N-best hypothesis received a numerical plausibility score; only the top 10 hypotheses were retained. The 1-best sentence error rate was about 34%, the 5-best error rate (i.e., the frequency with which the correct hypothesis was not in the top 5) about 19%, and the 10-best error rate about 16%.</Paragraph>
      <Paragraph position="2"> Linguistic processing was performed using a version of the Core Language Engine (CLE) customized to the ATIS domain, which was developed under the SRI-SICS-Telia Research Spoken Language Translator project \[1, 11, 12\]. The CLE normally assigns a hypothesis several different possible linguistic analyses, scoring each one with a plausibility measure. The plausibility measure is highly optimized \[3\], and for the ATIS domain has an error rate of about 5%. Only the most plausible linguistic analysis was used.</Paragraph>
      <Paragraph position="3"> The general CLE grammar was specialized to the domain using the Explanation-Based Learning (EBL) algorithm \[13\] and the resulting grammar parsed using an LR parser \[14\], giving a decrease in analysis time, compared to the normal CLE left-corner parser, of about an order of magnitude. This made it possible to impose moderately realistic resource limits: linguistic analysis was allowed a maximum of 12 CPU seconds per hypothesis, running SICStus Prolog on a Sun SPARCstation 10/412. Analysis that overran the time limit was cut off, and corresponding data replaced by null values. Approximately 1.2% of all hypotheses timed out during  linguistic analysis; the average analysis time required per hypothesis was 2.1 seconds.</Paragraph>
      <Paragraph position="4"> Experiments were carried out by first dividing the corpus into five approximately equal pools, in such a way that sentences from any given speaker were never assigned to more than one pool 3 . Each pool was then in turn held out as test data, and the other four used as training data.. The fact that utterances from the same speaker never occurred both as test and training data turned out to have an important effect on the results, and is discussed in more detail later.</Paragraph>
      <Paragraph position="5">  hypothesis by the DECIPHER TM recognizer.</Paragraph>
      <Paragraph position="6"> This is typically a large negative integer.</Paragraph>
      <Paragraph position="7"> In coverage: Whether or not the CLE assigned the hypothesis a linguistic analysis (1 or 0).</Paragraph>
      <Paragraph position="8"> Unlikely grammar construction: 1 if the most plansible linguistic analysis assigned to the hypothesis by the CLE was &amp;quot;unlikely&amp;quot;, 0 otherwise. In these experiments, the only analyses tagged as &amp;quot;unlikely&amp;quot; are ones in which the main verb is a form of be, and there is a number mismatch between subject and predicate-for example, &amp;quot;what is the fares?&amp;quot;.</Paragraph>
      <Paragraph position="9"> Class N-gram discriminants (four distinct knowledge sources): Discrimination scores for 1-, 2-, 3- and 4-grams of classes of surface linguistic items. The class N-grams are extracted after some surface words are grouped into multi-word phrases, and some common words and groups are replaced with classes; the dummy words *START* and *END* are also added to the beginning and end of the list, respectively. Thus, for example, the utterance one way flights to d f w would, after this phase of processing, be *START* flight_type_adj flights to airport_name *END*.</Paragraph>
      <Paragraph position="10"> Grammar rule discriminants: Discrimination scores for the grammar rules used in the most plausible linguistic analysis of the hypothesis, if there was one.</Paragraph>
      <Paragraph position="11"> Semantic triple diseriminants: Discrimination scores for &amp;quot;semantic triples&amp;quot; in the most plausible linguistic analysis of the hypothesis, if there was one. A semantic triple is of the form (Head1, Rel, Head2), where Head1 and Head2 are head-words of phrases, and Rel is a grammatical relation obtaining between them.</Paragraph>
      <Paragraph position="12"> Typical values for Rel are &amp;quot;subject&amp;quot; or &amp;quot;object&amp;quot;, when Head1 is a verb and Head2 the head-word of one of its arguments; alternatively, Rel can be a preposition, if the relation is a PP modification of an NP or VP. There are also some other possibilities (cf. \[3\]).</Paragraph>
      <Paragraph position="13"> 3We would llke to thank Bob Moore for suggesting this idea. The knowledge sources naturally fall into three groups. The first is the singleton consisting of the &amp;quot;recognizer score&amp;quot; KS; the second contains the four class N-gram discriminant KSs; the third consists of the remMning &amp;quot;linguistic&amp;quot; KSs. The method of \[3\] was used to calculate near-optimal weights for three combinations of KSs:  1. Recognizer score + class N-gram discriminant KSs 2. Recognizer score + linguistic KSs 3. All available KSs  To facilitate comparison, some other methods were tested as well. Two variants of the highest-in-coverage method provided a lower limit: the &amp;quot;straight&amp;quot; method, and one in which the hypotheses were first rescored using the optimized combination of recognizer score and N-gram discriminant KSs. This is marked in the tables as &amp;quot;N-gram/highest-incoverage&amp;quot;, and is roughly the strategy described in \[6\]. An upper limit was set by a method that selected the hypothesis in the list with the lower number of insertions, deletions and substitutions. This is marked as &amp;quot;lowest WE in 10-best&amp;quot;.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML