File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/89/h89-2008_metho.xml
Size: 22,398 bytes
Last Modified: 2025-10-06 14:12:18
<?xml version="1.0" standalone="yes"?> <Paper uid="H89-2008"> <Title>THE VOYAGER SPEECH UNDERSTANDING SYSTEM: A PROGRESS REPORT*</Title> <Section position="3" start_page="0" end_page="51" type="metho"> <SectionTitle> TASK DESCRIPTION </SectionTitle> <Paragraph position="0"> In order to explore issues related to a fully-interactive spoken language system, we have selected a task in which the system knows about the physical environment of a specific geographical area as well as certain objects inside this area, and can provide assistance on how to get from one location to another within this area. The system, which we call VOYAGER, currently focuses on the the city of Cambridge, Massachusetts, between MIT and Harvard University, as shown in Figure 1. It can answer a number of different types of questions about certain hotels, restaurants, hospitals, and other objects within this region. At the moment, VOYAGER has a vocabulary of 324 words. Within this limited domain of knowledge, it is our hope that *This research was supported by DARPA under Contract N00014-89-J-1332, monitored through the Office of Naval Research. VOYAGER will eventually be able to handle any reasonable query that a native speaker is likely to initiate. As time progresses, VOYAGER'S knowledge base will undoubtedly grow.</Paragraph> </Section> <Section position="4" start_page="51" end_page="51" type="metho"> <SectionTitle> SYSTEM DESCRIPTION </SectionTitle> <Paragraph position="0"> VOYAGER is made up of three components. The speech recognition component converts the speech signal into a set of word hypotheses. The natural language component then provides a linguistic interpretation of the set of words. The parse generated by the natural language component is subsequently transformed into a set of query functions, which are passed to the back-end for response generation. The back-end is an enhanced version of a direction assistance program developed by Jim Davis of MIT's Media Laboratory \[3\].</Paragraph> <Paragraph position="1"> We will describe each component in sequence, paying particular attention to those parts that have not been previously reported.</Paragraph> </Section> <Section position="5" start_page="51" end_page="52" type="metho"> <SectionTitle> SPEECH RECOGNITION COMPONENT </SectionTitle> <Paragraph position="0"> The first component of VOYAGER uses the SUMMIT speech recognition system developed in our group.</Paragraph> <Paragraph position="1"> SUMMIT places heavy emphasis on the extraction of phonetic information from the speech signal. It achieves speech recognition by explicitly detecting acoustic landmarks in the signal in order to facilitate acoustic-phonetic feature extraction. The system can be trained automatically, since it does not rely on extensive knowledge engineering. The design philosophy, implementation, and evaluation of the SUMMIT system have been described in detail previously \[4\]. As a result, we will only report in this paper modifications to the system since the last workshop. These include the development of a new module for lexical expansion via phonological rules, and a new corrective training procedure.</Paragraph> <Section position="1" start_page="52" end_page="52" type="sub_section"> <SectionTitle> Lexical Expansion </SectionTitle> <Paragraph position="0"> The original SUMMIT system used a phonological expansion capability provided to us by SRI \[6\]. Within the last year, however, we have decided to rewrite this part of the system in order to establish increased flexibility and speed. The new version, named MARBLE, offers several new properties. A canonic set of phonemes is represented by a set of default values for distinctive features. Specified allophonic information due to context dependencies can be represented in the particular instance of the phoneme generated in a word lattice. Thus, for instance, when a word-final/s/and a word-initial/s/merge, the resulting/s/can be marked as \[+geminate\]. This information can then be incorporated into the scoring for the particular allophone. The allophonic slot can also be used to indicate place of articulation of adjacent consonants, for example, to facilitate the decoding of context-dependent models. The rule-writing process is also straightforward, and it is simple to keep track of the rule ordering. Finally, the time it takes to expand a lexicon has been reduced. We believe this new rule system will be a powerful tool for effectively representing context dependencies.</Paragraph> <Paragraph position="1"> Corrective Training The training of SUMMIT is performed iteratively, after being initialized on the TIMIT database \[4,5\]. For each iteration, the recognizer computes the best alignment between a path in the acoustic phonetic network and a path in the lexical network, i.e., the recognized output. The recognizer also computes a forced alignment using only the correct string of words. The system then trains the next iteration of phonetic models based on the matches between lexical arcs and phonetic segments in the forced alignments. The recognizer also adjusts lexical arc weights based on a comparison of the number of times the arc was used incorrectly (present in the best alignment but not in the forced alignment) to the number of times the arc was missed (present in the forced alignment but not in the best alignment). The goal of this corrective training procedure is to equalize the number of times an arc is missed and the number of times the arc is used in the wrong way.</Paragraph> <Paragraph position="2"> If sufficient training data is not available for a particular arc, then the weights are derived by collapsing it with other phonetically-similar arcs.</Paragraph> <Paragraph position="3"> Presently, lexical decoding is accomplished by using the Viterbi algorithm to find the best path that matches the acoustic-phonetic network with the lexical network. Since the speech recognition and natural language components are not as yet fully integrated, we currently use a word-pair language model with a perplexity of 22 to constrain the search space.</Paragraph> </Section> </Section> <Section position="6" start_page="52" end_page="54" type="metho"> <SectionTitle> NATURAL LANGUAGE COMPONENT </SectionTitle> <Paragraph position="0"> In the context of a spoken language system, the natural language component should perform two critical functions: 1) to provide constraint for the recognizer component, and 2) to provide an interpretation of the meaning of the sentence to the back end. Our natural language system, TINA, was specifically designed to meet these two needs. The basic design of TINA has been described elsewhere \[7\], and therefore will only be briefly mentioned here. Instead, we would like to focus on the issue of how to incorporate semantics into the parses. We have found that an enrichment of the parse tree with semantically loaded categories at the lower levels leads to both improved word predictions and a relatively straightforward interface with the back end.</Paragraph> <Section position="1" start_page="52" end_page="54" type="sub_section"> <SectionTitle> General Description </SectionTitle> <Paragraph position="0"> The grammar is entered as a set of simple context-free rules which are automatically converted to a shared network structure. The nodes in the network are augmented with constraint filters (both syntactic and semantic) that operate only on locally available parameters. Typically, several independent hypotheses are simultaneously active, and parameter modifications include protection of shared information, such that a parallel implementation would be possible. Efficient memory use is achieved through a recycling mechanism for the node structures, such that they become available in a resource pool whenever their current assignment is completed. These issues will become important as we move towards a fully integrated system.</Paragraph> <Paragraph position="1"> One of the key features of TINA is that all arcs in the network are associated with probabilities, acquired automatically from a set of example sentences. It is important to note that the probabilities are established not on the rule productions but rather on arcs connecting sibling pairs in a shared structure for a number of linked rules. For instance, all occurrences of SUBJECT are pooled together for probability assignments on their children, regardless of the structural positions of these occurrences within a clause. The effect of such pooling is essentially a hierarchical bigram model. We believe this mechanism offers the capability of generating probabilities in a reasonable way by sharing counts on syntactically/semanticaily identical units in differing structural environments.</Paragraph> <Paragraph position="2"> Semantic Filtering For VOYAGER, we were interested in designing a parser that could handle all reasonable ways a person might request information within the domain, but that would also reject any ill-formed sentences, on the grounds of both semantic and syntactic anomalies. Building such a tight grammar not only leads to a very low perplexity for the recognizer, but also virtually eliminates the problem of multiple parses. This is because all parses that are syntactically legitimate but semantically anomalous are weeded out. It has the added benefit of improving computation time, if the semantic constraints are integrated early in the parsing process, more or less at the first chance of resolution.</Paragraph> <Paragraph position="3"> We also wanted to maintain our criterion that a node should only have access to information locally available to it by default. That is, it should not be allowed to hunt back through the parse tree looking for a resolution of, for example, number agreement. By default, all nodes pass along the parameters passed to them by near relatives. The hard part is to come up with a compact representation that contains all information necessary to carry out the constraints. In terms of syntax, there are patterns describing properties such as person, number, case, and determiner. Semantics are represented by patterns that include an automatic hierarchical inheritance of broader properties from more specific ones. Thus for example the semantic category Restaurant automatically acquires Building and Place as additional semantic features. In addition to the syntactic features, nodes also pass along semantic features that are automatically reset by designated nodes, such as terminal vocabulary entries.</Paragraph> <Paragraph position="4"> The two slots, Current-Focus and Float-0bject that are used for dealing with gaps, turned out to also be very useful for providing semantic constraint. In fact, we decided to take the approach of only using these two parameters for semantic filtering, to see whether in fact that would be adequate. Their use in the gap mechanism is described elsewhere \[7\], but for clarification we will briefly review it here. Generators are nodes that enter their subparse into the Current-Focus slot. Activators move the Current-Focus into the Float-0bject position, for their children. Absorbers can accept the Float-Obj ect as their subparse, in place of something from the input stream. The net result of this mechanism, aside from its intended use in gap resolution, is that it provides a second-order memory system for identifying the semantic categories of certain key content words in the history.</Paragraph> <Paragraph position="5"> As an example, consider the sentence, &quot;(What street)~ is the Hyatt on (ti)? * The Q-Subject places &quot;what street&quot; into the CurzenZ-Focus slot, but this unit is activated to Float-0bject status by the following Be-Question. The Subject node refills the now empty Current-Focus with &quot;the Hyatt.&quot; The node A-Street, an absorber, can accept the Float-0bj ect as a solution, but only if there is tight agreement in semantics, i.e., it requires the identifier Street. Thus a sentence such as &quot;(What restaurant)~ is the Hyatt on (t~)&quot; would fail on semantic grounds. Furthermore, the node On-Street imposes semantic restrictions on the Current-Focus.</Paragraph> <Paragraph position="6"> Thus the sentence &quot;(What street)~ is Cambridge on (ts)?&quot; would fail because On-Street does not permit Region as the semantic category for the Current-Focus, &quot;Cambridge.&quot; The Current-Focus always contains the subject whenever a verb is proposed, and therefore it is easy to specify filtering constraints for subject-verb relationships. Thus for example, the verb &quot;intersect&quot; demands that its subject be Street and the verb &quot;eat&quot; demands person. We have not yet incorporated probabilities into the semantic predictions, mainly because our domain is simple enough that they don't seem necessary.</Paragraph> <Paragraph position="7"> However, in principle probabilities could be added. Furthermore, these probabilities could be acquired automatically by parsing a collection of sentences and counting semantic co-occurrence patterns.</Paragraph> <Paragraph position="8"> An indicator of how well our semantic restrictions are doing can be obtained by running the sentence generator with the semantic filters in place. Table 1 gives a list of five consecutively generated sentences from the Voyager domain. For the most part, generated sentences are now well-formed both semantically and syntactically.</Paragraph> <Paragraph position="9"> 1. Do you know the most direct route to Broadway Avenue from here? 2. Can I get Chinese cuisine at Legal's? 3. I would like to wMk to the subway stop from any hospital.</Paragraph> <Paragraph position="10"> 4. Locate a T-stop in Inman Square.</Paragraph> <Paragraph position="11"> 5. What kind of restaurant is located around Mount Auburn in Kendall Square of East Cambridge?</Paragraph> </Section> </Section> <Section position="7" start_page="54" end_page="55" type="metho"> <SectionTitle> APPLICATION BACK-END </SectionTitle> <Paragraph position="0"> After an utterance has been processed by TINA, it is passed to an interface component which constructs a command function from the natural language parse. This function is subsequently passed to the back-end where a response is generated. In this section, we will describe VOYAGER's current command framework, the interface between TINA and the back-end, and some of the discourse capabilities of the back-end.</Paragraph> <Paragraph position="1"> Command Framework We will illustrate the current command framework of VOYAGER by way of the simple example shown below: Query: Where is the nearest bank to MIT? Function: (LOCATE (NEAREST (BANK nil) (SCHOOL &quot;HIT&quot;))) LOCATE is an example of a major function that determines the primary action to be performed by the command. It shows the physical location of an object or set of objects on the map. Table 2 lists some of the major functions currently implemented in VOYAGER.</Paragraph> <Paragraph position="2"> Functions such as BANK and SCHOOL in the above example access the database to return an object or a set of objects. Such functions search for all entries that match the string pattern provided. When null arguments are provided, all possible candidates are returned from the database. Thus, for example, (SCHOOL &quot;HIT&quot;) and (BANK n+-l) will return the objects MIT and all known banks, respectively. Table 3 contains some examples of current data functions.</Paragraph> <Paragraph position="3"> Finally, there are a number of functions in VOYAGER that act as filters, whereby the subset that fulfills some requirements are returned. The function (NEAREST X y), for example, returns the object in the set X that is closest to the object y. Table 4 contains several examples of filter functions.</Paragraph> <Paragraph position="4"> Note that these functions can be nested, so that they can quite easily construct a complicated object. For example, &quot;the Chinese restaurant on Main Street nearest to the hotel in Harvard Square that is closest to City Hall&quot; would be represented by, locate a set of objects describe a set of objects identify a property of a set of objects compute distance between two objects compute directions between two objects</Paragraph> </Section> <Section position="8" start_page="55" end_page="56" type="metho"> <SectionTitle> STREET ADDRESS INTERSECTION SQUARE </SectionTitle> <Paragraph position="0"> return a set of streets return a set of addresses return a set of intersections of two streets return a set of squares the subset of objects that are at a location the subset of objects that are on a street the subset of objects that serve a particular kind of food the single object of a set that is nearest to a location Interface between TINA and Back-End Our natural language component does not produce a logical form that is a separate entity from the parse itself. Instead, structural roles such as Subject and Dir-Object are an integral part of the parse tree. Furthermore, prepositional phrases are given case-frame-like identities such as From-Loc and In-Region \[8\]. Because of the availability of such semantic labels within the parse tree, the nested command sequence required by the back-end can be generated by a recursive procedure operating directly on the parse tree. The parses are transformed to a set of commands in a two-stage process. The first stage establishes the major function of the sentence, and the second stage fills in any arguments required by the major function. Each stage uses a list of entries that contain a parse pattern, a back-end function specification, and one or more argument specifications. In the first stage, each parse pattern corresponds to a sequence of one or more nodes in the parse tree and can specify a hierarchical constraint between certain nodes. Each argument specification corresponds to one or more entries in the second-stage list. In the second stage, a parse pattern can only be a single node, and each argument specification may be either one or more entries in the second-stage list, a terminal node, or a null value. Terminal nodes return a string from the parse tree, such as &quot;MIT&quot;. In most cases, each function specification corresponds to a single back-end function or a null value. When present, a function will be called with its associated arguments, such as (SCHOOL &quot;HIT&quot;). A function specification can also designate that the arguments are wrapped around each other. This mechanism is useful for generating nested filtering operations.</Paragraph> <Paragraph position="1"> When a parse is passed to the interface component, the patterns in the first-stage list are compared to the parse tree. Whenever a match is found, any argument specifications are passed one at a time to the second stage for resolution. If the argument specification is a single node, only the portion of the parse tree found below this node is processed, thus restricting the domain of the second stage analysis. If there are multiple entries in an argument specification, the first one found is processed in the same way as a single node.</Paragraph> <Paragraph position="2"> In our previous example, the presence of the word &quot;where&quot; in a trace would result in the specification of LOChTE as the major function. In this case, the first node found in the associated argument specification is a Subject. The portion of the parse tree found below Subject is' thus passed to the second stage. In the second stage analysis the Subject node would find an A-Place node in the subparse. Evaluation of this node would subsequently generate two arguments (BhNK nil) and (NEhREST (SCHOOL &quot;HIT&quot;)) which are wrapped to produce the desired result.</Paragraph> <Section position="1" start_page="56" end_page="56" type="sub_section"> <SectionTitle> Discourse Capabilities </SectionTitle> <Paragraph position="0"> The discourse capabilities of the current VOYAGER system are simplistic but nonetheless effective in handling the majority of the interactions within the designated task. Currently, anaphora resolution is dealt with in the back-end as opposed to the natural language component. We will describe briefly here how a discourse history is maintained, and how the system keeps track of incomplete requests, querying the user for more information as needed to fill in ambiguous material.</Paragraph> <Paragraph position="1"> Two slots are reserved for discourse history. The first slot refers to the location of the user, which can be set or referred to. The second slot refers to the most recently referenced set of objects. This slot can be a single object, a set of objects, or two separate objects in the case where the previous command involved a calculation involving both a source and a destination. Because of the location slots, user queries can include pronominal reference, such as &quot;What is their address?&quot; or &quot;How far is it from here?&quot; VOYAGER can also handle ambiguous queries, in which a function argument has either no value or multiple values, when a single value is required. Examples of ambiguous queries would be &quot;How far is a bank?&quot; since there are several banks, or &quot;How far is MIT?&quot; when no default location has been specified. VOYAGER points out such ambiguity to the user, by asking for specific clarification. The ambiguous command is also pushed onto a stack of incompletely specified commands. When the user provides additional information that is evaluated successfully, the top command in the stack is popped for reevaluation. If the additional information is not sufficient to resolve the original command, the command is again pushed onto the stack, with the new information incorporated. In the case where the clarification is also ambiguous, it is pushed onto the stack itself, until it can be clarified. A protection mechanism automatically clears the history stack whenever the user decides to abandon that line of discussion before all ambiguous queries are clarified. An example dialogue illustrating clarification capabilities is given in Table 5.</Paragraph> </Section> </Section> class="xml-element"></Paper>