¹¹institutetext: Stefan Ultes ²²institutetext: University of Cambridge, Engineering Department, Trumpington Street, Cambridge, UK, ²²email: su259@cam.ac.uk

Towards Natural Goal-oriented Spoken Dialogue Interaction with Artificial Intelligent Systems

Stefan Ultes

Abstract

The skill of natural spoken interaction is crucial for artificial intelligent systems. To equip these systems with this skill, model-based statistical dialogue systems are essential. However, this goal is still far from reach. This paper argues that novel approaches for the dialogue model and objective are needed in order to move a step closer towards natural spoken interaction.

Artificial intelligent systems are becoming prevalent in our everyday lives. Equipping such system with the ability to have a natural spoken interaction with humans is very important as machines and humans will soon closely collaborate in many tasks where efficient communication is essential. Natural spoken language is one of the most efficient means of communication and will play a key role.

To implement a spoken dialogue system (SDS) to equip an artificial system with the skill of natural spoken communication, the system must be able to understand the user input by putting it into the context of the whole interaction as well as to produce adequate responses. Hence, a system must be able to process complex dialogue structures and behave in a way which is perceived as natural by the user.

To realise adequate system behaviour, machine learning algorithms are essential disconnecting it from the abilities of a human designer. Especially reinforcement learning (RL), where the system learns to behave to optimise a given objective, allows the system to learn behaviour which is specific to human-machine interaction instead of imitating human-human behaviour¹¹1Conventional approaches that analyse the behaviour of the system and whether it is aligned with human expectation rely on user studies and data that are biased by human designers. Reinforcement Learning, though, allows to directly learn adequate system behaviour which is aligned with human expectations by definition. This has already shown to produce new and unexpected behaviour silver2016mastering.. This is relevant as the expected behaviour of an artificial system might be different from human behaviour.

Even though there have been many contributions to the state-of-the-art in goal-oriented spoken dialogue systems recently, the complexity of possible dialogue structures has remained rather limited. Instead, recent work proposed new RL algorithms su-EtAl:2017:SIGDIAL; casanueva2017benchmarking, new state models schulz2017frame; lee-stent:2016:SIGDIAL, or new system models wen2017; wenLIDM17; serban2016building.

Recent statistical SDSs that use RL young2013; lemon2012 are model-based where the dialogue model controls the complexity of dialogue structures that can be processed by the system²²2Model-free approaches like end-to-end generative networks serban2016building; li2016 have interesting properties (e.g., they only need text data for training) but they still seem to be limited in terms of dialogue structure complexity (not linguistic complexity) in cases where content from a structured knowledge base needs to be incorporated. Approaches where incorporating this information is learned along with the system responses based on dialogue data eric-manning:2017:SIGDIAL seem hard to scale. Furthermore, it seems to be counter-intuitive to learn this solely from dialogue data.. As it is common for RL-base systems, the system behaviour is defined by the objective function. Thus, to increase the naturalness of the dialogues, new dialogue models are needed allowing for more complex dialogue structures and dialogue objectives need to be investigated which allow for learning more natural system behaviour beyond task success.

The Dialogue Model The role of the dialogue model is to define the structure and internal links of the dialogue state as well as the set of available system and user acts (i.e., the semantic representation of the user input utterance). It further defines the abstraction layer interfacing the background knowledge base. Most current models are build around domains which encapsulate all relevant information as a section of the dialogue state that belongs to a given topic, e.g., finding a restaurant or hotel. However, the resulting flat domain-centred state that is widely used is not intuitive to model more complex dialogue structures like relations (e.g. ’I am looking for a hotel and a restaurant in the same area’, ’I need a taxi to the station in time to catch the train’), multiple entities (e.g., two restaurants) or connecting a set of objects (e.g., adding several participants to a calendar entry). More realistic dialogue models are needed along with RL algorithms that are able to deal with this added complexity.

The Dialogue Objective The dialogue objective is used in RL-based dialogue systems to guide the learning process distinguishing good from bad system behaviour. The current standard objective for goal-oriented dialogue systems is task success. While it is undoubtedly most important, is fails at capturing aspects of natural interaction: there are many policies that lead to successful dialogues, but what is the subset of policies that lead to an interesting, satisfying, funny, natural, polite, etc. interaction? And what implications can be drawn for the system responses in order to achieve this? Novel methods are needed both to incorporate these objectives in a feasible way and to investigate the implications on the system response.

Even though the model and the objective are both important for natural interaction, the dialogue model plays a core role. It does not only define the possible dialogue structures but also all possibilities how the system can express itself. And for finding feasible dialogue objectives in the increased complexity induced by the dialogue model, a good starting point are methods that build on active learning su2016acl or automatic estimation of the objective ultes2017domain.