Embodied Conversational Interface Agents · Designing Embodied Conversational Interface Agents

Defining embodied conversational interface agents

To construct the definition of an embodied conversational interface agent, let’s begin by defining one word at a time.

What is an agent?

A software agent refers to a program that has the ability to act autonomously, carrying out tasks on behalf of a human actor (Gulz et al., 2011). Several definitions of software agents also include the requirements that an agent can adapt and learn, be trained to respond in a certain way, and that they must be personalized, or engineered specifically to help the user (Koda, 1996; Fineman, 2004). However, for the purposes of broadening the basic definition of an agent for the later addition of specificity via conversational functionality and embodied representation, these qualifications are unnecessary for most agents – software agents without functional artificial intelligence are also considered valid.

As a law enforcement agency facilitates enforcement of the law, or an advertising agency facilitates the creation of advertising, software agents simply facilitate the use of software, and the ways in which they do so are outside of their general definition. The most important feature of an agent is the ability to act independently. Later, the word “bot,” derived from “robot,” which also means a machine with the ability to act autonomously, may be used interchangeably.

What is a conversational interface?

A conversational interface is any program that human users can interact with using text or speech (Niculescu et al., 2014). Several definitions of the conversational interface specify the means by which this is possible: natural language processing, machine learning, and artificial intelligence (Schuetzler et al., 2018), but again, this level of specificity as to the inner workings of the software is unnecessary. The use of graphics, hyperlinks, and other multimedia content are also considered part of the implementation of a conversational interface, but are not required – only text or speech input and output.

The socially constructed aspects of conversation such as the use of facial expressions and gestures will be covered under the definition of embodiment. An exception which straddles the definition of text content and embodied conversational interaction could lie in the use of emoji, but until the debut of Apple’s Animoji with the iPhone X, which allows the user to control the emoji with their own face (Emojipedia, 2017), the use of emoji faces in a conversational context was not construed as an embodiment of the emoji. Emojis have been defined by linguists as morpheme-like paralinguistic elements (Jibril & Abdullah, 2013) or discourse particles, signifying tone, and are considered part of language.

What is embodiment?

Embodiment has had many different definitions in various sciences, but in this context the most effective definition is by Cynthia Breazal, who defined embodied interfaces in her study of sociable humanoid robots for the International Journal of Human-Computer Studies:

“In general, these systems can be either embodied (the human interacts with a robot or an animated avatar) or disembodied (the human interacts through speech or text entered at a keyboard). The embodied systems have the advantage of sending para-linguistic communication signals to a person, such as gesture, facial expression, intonation, gaze direction, or body posture.” (Breazal, 2003, p. 120)

Put simply, an embodied interface is one in which a body or body parts are included in its representation.

Adding embodiment to a conversational interface allows for what is called multimodal communication. Multimodality includes the ability to input or output via different media (for example, speech and text), but also includes other modes of human-to-human communication like gesture, tone, facial expressions, and personality (Cohen & Oviatt, 1995).

Combining these definitions, an embodied conversational interface agent is any software program that acts autonomously, interacts via text or speech modality, and whose representation includes a body. Such agents include chatbots or chatterbots (Zdenek, 1999), pedagogical agents which aid in educational programs or take on instructional roles (Kim & Baylor, 2006), virtual human assistants (Gratch et al., 2004), as well as some software guides or wizards.

There is an enormous variation in the design of embodiment representations from 2-dimensional icon illustrations, to 3D animated avatars, to video captures of human actors, and every type of embodiment (some not even human). The goal of this research is to establish a framework for the best practices to follow in the design of embodied conversational agents for the enhancement of the user interface.

A skeuomorphic solution

Before delving into examples of real-life agents and the challenge of developing a general framework for the design of embodied conversational agents (henceforth often referred to as ECAs), it is necessary to clarify how they fit into the established paradigms of human-computer interaction.

The use of an embodied conversational agent is a skeuomorphic solution to the design problem of the human-computer interface. Skeuomorphism, as defined by Don Norman, is “the technical term for incorporating old, familiar ideas into new technologies, even though they no longer play a functional role” (Norman, 2013, p. 159). One of the best examples of this in computer technology is the icon commonly used in text editing programs for the “save” function, which is designed to look like a floppy disk. Floppy disks were originally used for data storage, but have become outdated within the first decade of the 21st century and are very rarely used. Nonetheless, the symbol of the floppy drive remains iconic for the storage of data.

Other examples of skeuomorphic design in the Graphical User Interface include icons of paper files and folders used to represent the directory structure of information, or the image of a reel-to-reel video camera used to represent digital video functionality.

In the case of embodied conversational interfaces, the old or outdated technology they represent is a human social interaction, when the actual function that they are attempting to familiarize for the user is a social interaction with a computer. If it is true that the social interface is a “universal interface” for human-computer interaction, as Reeves and Nass have claimed (Breazeal, 2003), then enhancing this effect by providing an explicitly social, embodied agent to interact with should make the interface even easier to use.

Gulz et al. assert in their 2011 study of conversational agents that the visual dimension “is a powerful means for engendering affordances for social interaction,” and “contributes strongly to the experience of a character with a personality... rather than simply a computer artifact.” (p. 130-131) In a similar study the same year, Amy Baylor concludes that “the agent’s appearance is the most important design feature, as it dictates the learner’s perception of the agent as a virtual social model.” (p. 291)

These studies build on significant evidence that humans can be socially influenced by software agents, and that the visual representation of the agent is key in enhancing this effect. Baylor alone cites seven different previous studies drawing this conclusion in her 2009 paper “Promoting motivation with virtual agents and avatars: role of visual presence and appearance,” (p. 3559) before confirming in her own research that the visual presence of an agent is critical for motivational and affective outcomes.

It is these affective outcomes, the arousal of users’ emotions, in addition to the previously studied expressions of sociality with computers that were encouraged without an embodied representation, that are some of the most interesting effects of embodied conversational agent design. Some of these effects include:

increased naturalness of communication (Schuetzler et al., 2018)
greater perceptions of agent credibility (Baylor & Ryu, 2003)
deeper learning and higher motivation (Kim & Baylor, 2006)
mitigation of user frustration (Baylor, 2009)

So far, these affective outcomes are positive, but after a brief overview of embodied conversational agents developed in research contexts and commercial applications, we’ll look more in depth at how complicated designing agents for social interaction and emotional affect can be.

Example agents from research contexts

ELIZA adapted for the Commodore PET in 1997

Figure 1. ELIZA adapted for the Commodore PET in 1997

ELIZA was possibly the first conversational agent. Developed at MIT in 1966 by Joseph Weizenbaum, ELIZA’s conversational functionality was programmed to mimic interaction with a psychotherapist. (Wortzel, 2007) According to the above adaptation from 1997 (Figure 1), and the much more contemporary adaptation from 1977 (Figure 2), ELIZA had no embodiment and was a purely text-based conversational interface. However, this did not stop Weizenbaum’s staff from developing close relationships with the bot during therapeutic chat sessions. Among Weizenbaum’s notes, he wrote, “What I had not realized is that extremely short exposures to a relatively simple computer program could induce powerful delusional thinking in quite normal people.” (Wortzel, 2007)

ELIZA adapted in 1977

Figure 2. ELIZA adapted in 1977

ELIZA is considered the great-grandmother of modern chatbots, many of which have been developed to compete for the Loebner Prize in artificial intelligence, an annual prize awarded to the artificial intelligence program most able to resemble a human through a chat interface, such as A.L.I.C.E. (Artificial Linguistic Internet Computer Entity) and Mitsuku. A.L.I.C.E and the Artificial Intelligence Markup Language (AIML) were originally developed by Richard Wallace in 1995, and the AIML language now forms the foundation for the programming of many modern chatbots like those produced by Pandorabots, Inc., a leading platform for commercial chatbot development.

Mitsuku was originally developed using AIML by Steve Worswick in 2006, and a version of Mitsuku’s code base is now licensed as a Pandorabots product. Pandorabots touts Mitsuku as “widely considered the world’s best, most humanlike, conversational chatbot,” and the bot has won the Loebner prize in 2013, 2016, and 2017.

Mitsuku is an interesting case of a research agent turning into a commercial product, and undergoing a stylistic evolution over time. Through this first case study, we’ll expose many of the categories of design features that will be analyzed in depth to develop a general framework for the design of conversational interface agents.

There have been many other embodied conversational agents developed in research contexts that are worth mentioning before moving on to other ECAs available as commercial products.

Several agents have been developed by MIT laboratories, including LAURA, an agent integrated with the MIT FitTrack application, meant to motivate users to exercise. (Gama et al., 2011). Other agents have been developed for various MIT Media Lab applications including Newt, an agent developed for a personalized news filtering system, Maxims, an e-mail assistant, and the unnamed calendar agent, who was used to schedule meetings (Koda, 1996).

REA, the Real Estate Agent

Figure 3. REA, the Real Estate Agent

REA (Figure 3) was developed in the MIT Media Lab to inhabit the role of a real estate agent in a virtual environment. This agent was used in several studies by Justine Cassell to examine the effects of multimodal interfaces with an agent designed to use body language and nonverbal conversational cues such as gaze and facial expressions to facilitate conversation. (Breazeal, 2003; Cassell, 2000; Cassell, 2001).

MACK, the Media Lab Autonomous Conversation Kiosk

Figure 4. MACK, the Media Lab Autonomous Conversation Kiosk

Another notable bot to come out of the MIT Media Lab was MACK (Figure 4), the Media Lab Autonomous Conversation Kiosk, an agent situated in the lobby of a lab building in front of a map. MACK was able to answer questions about the labs and give directions using gestures and pointing out features on the map (Huang, 2010; Cassell, 2001).

Similar bots have been implemented by several museums, designed to guide and entertain visitors. Two agents named August and Pixie were installed in Swedish culture and telecommunications museums to guide and entertain visitors, and an agent named Sgt. Blackwell was installed in several contemporary art museums in the U.S (Huang, 2010). Perhaps the most well-known of these virtual docents is Max, a guide agent created in 2004 for the Heinz Nixdorf MuseumsForum, a computer museum in Germany. Reportedly, Max was quite successful in interacting socially by engaging museum visitors in conversations about the exhibitions, museum information, and other topics (Kopp et al., 2005).

Other agents in research contexts were developed to target specific groups, such as MAY, designed to assist teenagers in self-reflection, SAM, created to engage children in a mixed-reality play space, and one called the Senior Companion, developed to help elderly people annotate photographs with stories from their lives (Cassell, 2001; Gama et al., 2011). Agents have also been developed to inhabit other social roles, including Greta, a doctor agent implemented as a 3D talking head that could give patients information about drug prescriptions (Huang, 2010) and Steve, an agent designed by the Information Sciences Institute at the University of Southern California to train naval recruits to operate equipment on a virtual ship (Breazeal, 2003).

Case study: Mitsuku

The original Mitsuku chat interface

Figure 5. The original Mitsuku chat interface

The original Mitsuku avatar

Figure 6. The original Mitsuku avatar

Figures 5 and 6 show the version of Mitsuku’s avatar that appears on the original Mitsuku website by Steve Worswick. The original version of Mitsuku’s avatar is a quite amateur-looking illustration of a teenage girl in an outfit reminiscent of a Japanese schoolgirl uniform. This representation is emblematic of research bot design, in that not much care has been taken to present a polished, or even consistent, design to represent the bot’s embodiment.

However, for the intended audience of Turing-testing Loebner Prize judges who will never see an avatar and lonely people on the internet (as her original home page reads, “You need never feel lonely again! Mitsuku is your new virtual friend and is here 24 hours a day just to talk to you.”), this lack of professional design in the original bot’s representation is suitable, and may even be inviting. As evidence of users’ affinity for the original embodiment, one need look no further than Worswick’s gallery of Mitsuku fan art that has been submitted to his site and the Mitsuku Facebook page, which contains over 50 works at a similar artistic skill level.

This unintended benefit of community building around the low-fidelity representation of the most recently highest-ranked artificial intelligence, however, did not survive into the design’s iteration as a commercial product.

The new Mitsuku avatar

Figure 7. The new Mitsuku avatar

The new version of Mitsuku advertised on the Pandorabots website shown in Figure 7 is a significant upgrade in terms of graphics, but a downgrade in terms of likability. The 3-D figure now has an edgy side-shaved haircut with bangs and a purple ponytail, and sports a skull earring with a modern, layered tank top and shirt outfit. The new avatar wears a somewhat neutral or slight smiling expression when not speaking, but in general seems much less friendly than the 2-D illustration which is always smiling and betrays less social awareness in clothing style.

Mitsuku’s interface on Twitch.tv

Figure 8. Mitsuku’s interface on Twitch.tv

Mitsuku is now available to chat on virtually every modern messaging platform as well as on a 24/7 Twitch stream called @Mitsuku_IRL (Figure 8). One of the more interesting features of this development is Mitsuku’s situation, where only one of the visual signifiers of the original program has carried over into the new design: the background used for the original avatar and the background for the new avatar in its native apps gives a clue as to the bot-like nature of the program. The original avatar conveys this with a pattern of 1s and 0s or a circuit board pattern, and the new one also has a circuit board pattern in the shape of a heart floating behind Mitsuku.

On the Mitsuku_IRL stream, however, Mitsuku is digitally inserted atop a moving pan of Google Maps locations that can be controlled by the people in Twitch chat, and the uncanniness of both images is magnified, particularly because Mitsuku remains in 1/2 body view atop every background, never fully being seen to inhabit the space.

In this context, Mitsuku’s design is similar to, but less convincing than, the computer-generated Instagram “influencer” Lil Miquela (Figure 9), who is often posed in front of real places and interacting with real objects and brands in her photos. However, Mitsuku’s shortcoming here as a realistically integrated virtual human is understandable given the constraints of the avatar’s animation and having to adapt it to many different platforms.

Lil Miquela, via Instagram.com/lilmiquela

Figure 9. Lil Miquela, via Instagram.com/lilmiquela

What we can take away from this analysis of Mitsuku’s design:

1. More realistic avatars are not always better – there is very little existing fan art of the new Mitsuku design, and the situation of the CGI figure within real locations is both unconvincing and unnecessary.

2. Visual signifiers of roboticness (the binary and circuit board patterns) feel necessary somewhere in the interface, particularly when the avatar has a human embodiment, even if its level of realism is very low; this will come into play as a design element later, when deciding between human embodiments and alternative body types.

Case Study example Mitsuku

More realistic ≠ better

Example agents from commercial products

Both Apple and Microsoft have developed conversational agents in the past to facilitate the use of their operating systems or other software programs. In the early 1990s, developments in technology that allowed for a larger visual range in the GUI prompted the implementation of programs like Apple Guides, Apple Knowledge Navigator, and the Microsoft Persona Project, all of which used embodied characters to guide the user through their functionality (Brahnam, Karanikas, & Weaver, 2011). “Phil,” the character created for Apple Knowledge Navigator (Figure 10), was represented as both a human and a cartoon figure, with a signature bow tie as part of his uniform so that he would be recognizable across interface implementations (Koda, 1996). The bow tie also signifies his role as an assistant, similar to a butler or a waiter.

Phil, from Apple Knowledge Navigator

Figure 10. Phil, from Apple Knowledge Navigator

One of the most recognizable conversational interfaces was Microsoft Bob, produced in 1995 as part of Microsoft Home. Inspired by the Navigator interface by Packard Bell (Swartz, 2003), Bob used the representation of an office within the computer as a design metaphor, and various cartoon characters within the office, such as the dog shown in Figure 11, to interact with various programs and computer functions.

Microsoft Bob

Figure 11. Microsoft Bob

In 1997, Microsoft included their cartoon agent technology in the Microsoft Office programs by integrating it with the Answer Wizard functions, creating the infamous Microsoft Office Assistant.

The Microsoft Office Assistants

Figure 12. The Microsoft Office Assistants

Several characters were included in the Microsoft Office Assistant program (as seen in Figure 12), including a wizard (literalizing the metaphor of the Answer Wizard), human characters resembling both Einstein and Shakespeare, two dogs most likely descended from the cartoon dog from Bob, two cats, a puzzle vaguely resembling the Microsoft logo, a planet Earth, an alien spaceship, a smiling cartoon face, and a bipedal, three-dimensionally rendered robot. Iterations of several these character designs (and a few that never saw the light of day, such as the genie) can be seen in several patents filed by Microsoft from 1994 to 1998 (Figure 13, via McCracken, 2009).

The default character, a paper clip with human facial features and an articulated wire body, named Clippit but colloquially known as Clippy, has become widely known as one of the most annoying conversational interfaces ever developed. Clippy will be the subject of our next case study.