XML Viewer - h01-1016

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/01/h01-1016_metho.xml
Size: 20,354 bytes
Last Modified: 2025-10-06 14:07:32
<?xml version="1.0" standalone="yes"?>
<Paper uid="H01-1016">
  <Title>Development of the HRL Route Navigation Dialogue System</Title>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2. DEVELOPMENT PHASES
</SectionTitle>
    <Paragraph position="0"> One can identify four distinct subproblems which must be solved for a navigation system: 1) the natural language navigation interface, 2) street name recognition, 3) the natural language destination entry interface given street name recognition, and 4) the map database interface. We have partitioned the problem and have phased our development to progressively implement solutions with increasing complexity.</Paragraph>
    <Paragraph position="1"> Navigation system implementation is complicated by the potential of having a very large street name vocabulary with many unusual and uncommon pronunciations with significant variations across speakers. The appropriate name space is dynamic since it depends on the location of the vehicle.</Paragraph>
    <Paragraph position="2"> Our initial system does not accept queries with proper street names. In addition, we assume separate destination entry and route planning systems, and that one or more routes have been loaded into the navigation system. The system relies on open dialogue to resolve the directions at any stage of the journey and may or may not use the Global Positioning System (GPS) to determine the progress along the route. By implementing this system first we could concentrate on the dialogue aspects of the navigation problem and also establish a baseline with which to compare our other implementations.</Paragraph>
    <Paragraph position="3"> In the second phase we include a limited set of street names as part of the language model and lexicons. Initially we are using a predefined set of names with hand tuning of the pronunciations.</Paragraph>
    <Paragraph position="4"> Additional research is required to solve the street name recognition problem generally and automatically. We assume in-vehicle GPS and use a map matching system to determine the vehicle's position and if it is on-route. This phase includes development of the natural language components for destination entry and also broadens the scope of the navigation queries to include questions with and about street names. More distant plans include on-road route replanning, providing information to requests for specific street names or points of interest along the route, and traffic information and workarounds.</Paragraph>
  </Section>
  <Section position="5" start_page="0" end_page="1" type="metho">
    <SectionTitle>
3. IMPLEMENTATION
</SectionTitle>
    <Paragraph position="0"> Our implementation is based on the Galaxy-II system [6] from the Massachusetts Institute of Technology (MIT), which is the baseline for the Communicator program of the Defense Advance Research Projects Agency (DARPA). The architecture consists of a hub client that communicates, using a standard protocol, with a number of servers as shown in Figure 1. Each server generally implements a key system function including Speech Recognition,</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.1 Speech Recognition
</SectionTitle>
      <Paragraph position="0"> We use the latest MIT SUMMIT recognizer [8] using weighted finite-state transducers for the lexical access search. We have also &amp;quot;plugged in&amp;quot; alternate recognizers such as the Microsoft Speech SDK recognizer and the Sphinx [3] speech recognizer available as open source code from Carnegie Mellon University.</Paragraph>
      <Paragraph position="1"> We are in the process of developing a large database of in-vehicle utterances collected in various car models under a wide range of road and other background noise conditions. This data collection will be carried out in two phases, the first of which is completed; phase two is underway. Limited speech data will result from the first phase and subtantial speech data (appropriate for training acoustic models to represent in-vehicle noise conditions and testing of recognition engines) will come out of the second phase, and will become available through our partners in this collection effort, CSLR at University of Colorado, Boulder [4]. In the meantime we are using the MIT JUPITER acoustic models. The performance is acceptable for our language and dialogue model development, but we refrain from presenting any detailed recognizer results here since they would not reflect fairly on optimized recognizer performance.</Paragraph>
      <Paragraph position="2"> Our vocabulary consists of about 400 words without street names.</Paragraph>
      <Paragraph position="3"> We have an additional 600 street names gleaned from the Los Angeles area where we do much of our system evaluation.</Paragraph>
      <Paragraph position="4"> Baseforms for the vocabulary are derived from the PRONLEX dictionary from the Linguistic Data Consortium at the University of Pennsylvania. Extensive hand editing is needed especially for the street names. The MIT rule set is used for production of word graphs for the alternate pronunciation forms. We have derived a language model from a set of utterances that were initially generated based our best guess of the query space. As evaluation evolves, we modify the utterance list and retrain the language model. The language model uses classes and includes both bigram and trigram models.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.2 Application Interface
</SectionTitle>
      <Paragraph position="0"> We are building the application interface in several phases.</Paragraph>
      <Paragraph position="1"> Initially we are only answering queries about turns and distances during navigation. We obtain the database in two steps. First, we access a commercial map database using standard text I/O for destination entry and route planning. This produces a detailed set of instruction that includes many short segments such as on- and off-ramps. We filter this and rewrite the data to provide a set of natural driving instructions suitable for verbal communication.</Paragraph>
      <Paragraph position="2"> The result is a flat database, such as the one shown in Figure 2.</Paragraph>
      <Paragraph position="3"> This is loaded into to the system and used to formulate answers to the route queries. In the example in Figure 2 the estimated driving time is 45 minutes. Each row is a segment of the trip. The first and second columns code right, left, straight, and compass direction information. The third column is the segment length in miles and the last is the segment name.</Paragraph>
      <Paragraph position="4">  guidance instructions for the route between HRL Laboratories and the Los Angeles airport A sample dialogue is shown in Figure 3 which illustrates the kind of responses the system can generate from a database such as that shown above, given navigation queries of the sort shown; this sample was drawn from our phase I user-system data logs:  queries and showing the responses derived by the dialogue manager based on the database of Figure 2.</Paragraph>
      <Paragraph position="5"> Off-line construction of the global navigation database is not unrealistic since it could be done, at least in the near term, by a service organization such as OnStar from General Motors (GM).</Paragraph>
      <Paragraph position="6"> However as navigation systems become widely deployed, users will expect destination entry including real time route re-planning to be an integral part of system. We are developing a direct voice interface to the commercial map database that includes destination entry, route planning, and map matching using GPS data to determine if the vehicle is on-route or not.</Paragraph>
      <Paragraph position="7"> During the destination entry phase street names need to be robustly recognized. We are currently working with a subset of street names in the Los Angeles area preloaded in the recognizer and language models. It is untenable to keep all of the street names in Los Angeles loaded in the recognizer simultaneously (there are around 16,000, including 8,000 base names), thus we are developing a method for dynamic loading of map names local to the vehicle position which we will report on in the near future. We have experimented with using a subset of street names as a filter list, and as a lookup list based on spelling the first few letters, to try to resolve the destination requested. If this fails, or if the trip is outside of the area from which names are loaded, we rely on more complete spelling to determine the destination. The origin for the route plan is generally implied since it is determined by the GPS position of the vehicle most of the time. Once the destination is determined it is straightforward to continuously replan the route based on the current vehicle position and thereby be able to provide remedial instruction if the driver departs from the route plan.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.3 NL Analysis and Generation
</SectionTitle>
      <Paragraph position="0"> The core NLP components in our system are a TINA [5] grammar, context tracking mechanisms, and the GENESIS language generation module. The TINA grammar includes both syntactic and semantic elements and we try to extract as much information as possible from the parse. The information is coded in a hierarchical frame (Figure 4a) as well as a flat key-value pair (Figure 4b). In addition to handcrafting this grammar, a set of rules was also developed for the TINA inheritance mechanism.</Paragraph>
      <Paragraph position="1"> These rules are applied during context tracking, after the parse, to incorporate information from the dialog history into phrases such as &amp;quot;and after that&amp;quot; and &amp;quot;how about my second turn,&amp;quot; and are also used to incorporate modifications that are a result of dialogue  As noted, we use the MIT GENESIS server for language generation. Again this module is rule driven and we developed the lexicon, templates and rewrite rules needed for the three ways we use GENESIS. We extract the key-value pairs (e.g. Figure 4b) from the TINA parse frame. The key values are used to help control the dialogue management as well as provide easy access to the variable values. We use GENESIS to produce the English reply string that is spoken by the synthesizer. The example frame in Figure 4c in conjunction with our rules generates the sentence &amp;quot;From Pacific Coast Highway turn straight onto East I-10 freeway&amp;quot; Lastly GENESIS is used to produce an SQL query string for database access. Templates and rewrite rules determine which form the output from GENESIS will take. Technically these three uses (key-value, reply string, and SQL) are just generation of different languages.</Paragraph>
    </Section>
    <Section position="4" start_page="0" end_page="1" type="sub_section">
      <SectionTitle>
3.4 Dialogue Management
</SectionTitle>
      <Paragraph position="0"> We have developed servers for dialog management and to control the application interface for database query. The hub architecture supports use of a control table to direct which server function is called. This is especially useful for dialogue management. The control table is specified by a set of rules using logic and arithmetic operations on the key-value pairs. A well-designed set of rules makes it far easier to visualize the flow and debug the dialogue logic. For example, when a control rule such as: Clause &amp;quot;locate&amp;quot; !:from --&gt; turn_from_here fires on the key-value pairs (Figure 4b), the hub calls the turn manager function &amp;quot;turn_from_here&amp;quot;. In this simplified case, we are assuming if there is no &amp;quot;:from&amp;quot; key, the request is to locate an object (i.e. &amp;quot;turn&amp;quot;) relative to the vehicle's current position. In this case the function needs only to extract the value of the key &amp;quot;:ORD&amp;quot; and look up the data for the second turn in the database of Figure 2. This data is then written into the response frame, here called &amp;quot;speak_turn&amp;quot; and shown in Figure 4c. GENESIS uses this frame to generate the English language reply that is spoken by the synthesizer as described above.</Paragraph>
      <Paragraph position="1"> In the examples shown here we communicate with the database from the dialogue manager by downloading a flat database such as that of Figure 2, perhaps via a data link to an off-board service organization such as OnStar. In cases where we access databases directly, we use a separate server for this function. Generally, communications between the dialogue manager and database servers are routed via the Hub.</Paragraph>
      <Paragraph position="2"> Our dialogue manager has been designed to use GPS data when available (in which case GPS coordinates would also be a part of the database) or to use location information based on current odometer readings provided as input by drivers when GPS is not available. We use this latter method for demonstrating the system in a desktop setting, though we have also recently completed a utility for employing maps generated by our commercial navigation database, graphically displaying a driver's progress along an imaginary route. We are now employing this tool as part of our current iteration of system testing and revision.</Paragraph>
      <Paragraph position="3"> 3.4.1 Referential amiguities in driver queries The driver can query to determine turn or distance information relative to current vehicle position, relative to another turn or reference point in the database, or as an absolute reference into the route plan stored in the database. We have devoted considerable effort to dealing with ambiguities which may arise as a result of different ways users may be conceptualizing the route (that is, in absolute or relative terms), as well as the driver being at different points in the route, and at different points in the progression of a discourse segment. Queries such as &amp;quot;what's next?&amp;quot; can be ambiguous. Determining the correct interpretation requires consideration of the discourse history and the user's circumstances. For example, in the following dialog sequence (drawn from our data), there are at least two possible interpretations for &amp;quot;what is next?&amp;quot; in the third turn (U:user,  Notice that this query could be requesting information about the next turn from the driver's current position (i.e. the immediately approaching turn), or it could be requesting information about the third turn from the driver's current position, that is, the next turn from the most recently referred to turn. We will henceforth refer to these two interpretations as next-from-here and next-after-that, respectively.</Paragraph>
      <Paragraph position="4"> The factor which appears to have the most influence on which interpretation is given to this utterance originates neither in the utterance itself nor in the preceding dialog, but is purely circumstantial, namely, how much time has passed since the last utterance. Our assumption has been that there is a kind of timedependency factor in coherent discourses: while &amp;quot;what is next&amp;quot; is still within the scope of the preceding discourse context, it may (most likely will) be given the next-after-that interpretation. But after a certain length of time has elapsed, &amp;quot;what is next&amp;quot; cannot be interpreted as referring to some previously uttered instruction, but only as referring to the driver's current position. If we think of this in terms of the user's frame of reference for talking about their real or imagined location (we'll refer to this as the FROM value), then we could characterize this phenomenon as the value of FROM being reset to HERE in the absence of immediate discourse context.</Paragraph>
      <Paragraph position="5"> Interpretations of numbered turn references (e.g. &amp;quot;what's my second turn&amp;quot;) can also vary depending on another purely circumstantial factor, namely whether the driver is querying the system while preparing to begin the trip, or after she has begun driving. Some drivers will want to preview trip information before beginning to drive, and in this situation, interpretation of certain query types may differ from interpretation done during the trip. When the driver is querying the system before beginning to drive, she is more likely to conceive of and speak of the route in an absolute sense (cf. [7]). That is, the driver may conceive of the route as a fixed plan, wherein each turn and segment have a unique and constant order in a sequence. When conceiving of the  There is at least one further possible interpretation to &amp;quot;what is next?&amp;quot; here, at least if the proper prosodic features are present. If heavy emphasis is placed on &amp;quot;what,&amp;quot; the query has a quasi echo-question interpretation, indicating either that the user did not hear, or else is surprised at the prior instruction and is asking for clarification or repetition.</Paragraph>
      <Paragraph position="6"> route in this way, one may refer to turns by number in the route, rather than by number relative to current position. Although we have yet to gather real user data bearing on this question, our intuition is that once the trip is underway, especially once any significant distance has been traveled, if users do use numbered turn references at all, they will be much more likely to use them relative to their current position.</Paragraph>
      <Paragraph position="7"> Queries of this type are, for practical purposes, only ambiguous once the user has begun the trip, but prior to the absolute numbered turn. Drivers are very unlikely to be asking about the second turn in the route once they have passed the second turn.</Paragraph>
      <Paragraph position="8"> Moreover, since people will generally only keep track of turn numbers in the range of 1-3, (give or take 1), numbered turn references will only be ambiguous prior to the third or fourth turn in the route (nobody is likely to be asking &amp;quot;what is the eighth turn in the route&amp;quot;). What is more, if the user asks a numbered turn query before beginning the trip, the system response will be the same, since the relative and absolute turn numbers will at that point coincide. Thus, the only time a true ambiguity must be handled by the system is the time after the trip is underway, and before the fourth turn. It is perhaps worth noting that if one looks at the overall query interpretation problem as entailing a determination of whether the user is asking a question relative to their current position, or some other position, then the absolute/relative distinction is just a special case that.</Paragraph>
      <Paragraph position="9"> We have gone on the assumption that there are a substantial number of these ambiguous queries, not only for those of the &amp;quot;what's next&amp;quot; type, but also for some numbered turn requests, and for a class of distance query [1]. However, we have now carried out an experiment in which subjects interpreted such queries in a controlled setting, and the results indicate there is far less ambiguity in truly felicitous driver utterances than we originally hypothesized [2]. There probably will be some genuinely ambiguous queries, especially for a system which is not capable of detecting prosodic cues, however, we now are of the opinion that they will not comprise a significant percentage of driver queries.</Paragraph>
      <Paragraph position="10"> For the system which we describe herein, however, the control logic for queries of the type under discussion includes consideration of the temporal &amp;quot;reset&amp;quot; threshold discussed above, as indicated in the following table:</Paragraph>
      <Paragraph position="12"> The table is to be read column-by-column. Thus, the first column tells us that if we have a query with a numbered turn reference and a phrase which is semantically equivalent to &amp;quot;from here&amp;quot; (which is also the default), then the instruction number which will be requested (via SQL query) from the database is current+number.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML