Speak, Machine: Human-Computer Interaction and the Literalization of the Conversational Interface Metaphor · Designing Embodied Conversational Interface Agents

Human-Computer Interaction and interface metaphors

The study of Human-Computer Interaction, often shortened to HCI, is a discipline that refers to designing, engineering, and optimizing any and all of the elements that facilitate computer use. These elements include the physical hardware and ergonomic interfaces (such as keyboards, displays, and other associated devices that accommodate human hands and sensorium) as well as software interfaces. As Ivan Hybs points out in “Beyond the Interface: A Phenomenological View of Computer Systems Design,” the question of interaction addresses not only the design of the device and its accoutrements, but also the context of the device and the user, human practices involving computers, and how humans understand technology (Hybs, 1996).

This design discipline has advanced in step with the development of personal computers. According to the timeline provided by Jonathan Grudin in his 1990 paper, “The Computer Reaches Out: The Historical Continuity of Interface Design,” the focus of human-computer interaction shifted as the locus of control moved further away from the internal mechanisms of computers. In the early years of computing (the 1950s and 60s), when computer users were limited to scientists and programmers, users interacted directly with computer hardware and had to be familiar with the mathematical intricacies of data storage and programming in machine language. As higher-level programming languages and environments developed throughout the 1960s and 1970s, the need to interact directly with the hardware was greatly reduced. When personal computers became available for non-programmers in the 1980s and 1990s, the use of displays and keyboards further abstracted the user from the computer’s internal hardware, allowing them to control the computer and carry out tasks using it with no knowledge of the computer’s inner workings.

As the site where the user controls the computer moved further away from its internal hardware, the discipline of HCI developed to accommodate these ergonomic factors and design affordances. Grudin predicted in 1990 that in the future, the user interface would extend past the eyes and fingers and into the mind, as well as outward from the primary user and into the social and work environment, a development he called “groupware.”

This aspect of HCI, its emphasis on devices’ compatibility with human psychology as well as physiology, led to the development of visual and abstract metaphors for how people interact with computer data. There are three basic computer interaction metaphors that are still in use today: direct manipulation (“Data is a physical object”), navigation (“Data is in space”), and human interaction, or communication (“Computers are people”) (Fineman, 2004).

Direct manipulation and navigation metaphors in the Graphical User Interface

The first computer to have a Graphical User Interface (or GUI) was the Alto, developed at Xerox’s Palo Alto Research Center (PARC) between 1972 and 1973. Using a video display the size of an 8 inch by 10 inch sheet of paper, the user of the Alto could draw pictures or display text on the screen, and used a mouse to control a pointer to interact with objects. These objects included buttons, menus, and icons to launch and manipulate programs, and windows to allow the user to control and monitor multiple programs running simultaneously (Petzold, 2000).

It was the Alto computer that Steve Jobs witnessed at Xerox PARC in 1979, and which inspired him to implement a similar GUI in early Apple computers. And although many different types of personal computers have been developed in the intervening years, the fundamentals of the Graphical User Interface remain the same.

The Graphical User Interface, which allows the user to select, move, and manipulate objects within the computer through pointing, dragging and dropping, provides the “direct manipulation” metaphor, which positions the human as in control of a passive collection of objects which can be interacted with directly through their graphical representation (Fineman, 2004).

For example, instead of typing the command “rm somefile.txt,” into the command line to delete a file containing some text, the user can simply drag the image that represents the file, (most likely depicted in icon form as a sheet of paper to denote the fact that it’s a text file) into the image of a trash bin.

The greatest advantage of the Graphical User Interface and the direct manipulation metaphor is that it simplifies computer use for non-programmers. The visual metaphors presented in the GUI are intuitive (if they are designed well), and allow the user to quickly adapt to using different programs with the same actions (such as pointing, clicking, and dragging). It also visually lays out, via menus and buttons, all of the available options for the user, which precludes input errors (Cohen & Oviatt, 1995).

The other metaphor that the Graphical User Interface affords is the navigation metaphor, which is often useful when referring to data coming through the internet. Web “sites” where data is accessed, and the “location” of files within the computer’s directories are examples of this metaphor.

Cadence Kinsey has pointed out that the navigation metaphor also allows us to consider the user’s position in relation to technology, whether they interface with tools such as the screen and keyboard, mouse, trackpad, or stylus, or if they are directly manipulating an immersive, three-dimensional simulation of the computer system (now possible via virtual and augmented reality technology). In the 2014 article “Matrices of Embodiment: Rethinking Binary and the Politics of Digital Representation,” Kinsey writes, “Conceiving of the GUI as a space has allowed us to try to secure our own position in relation to the technology, to be able to say ‘I am here.’ In the GUI environment, the subject is constructed in and through the spatial metaphorics of computer vision.” (p. 905)

For the purposes of analyzing human-computer interaction, it is important to recognize that the users’ awareness of themselves and awareness of the computer they’re interacting with can be a design feature or flaw. What has been called “the perversity of computers” (Hybs, 1996) is that the computer is continuously “present-at-hand” in the Heideggerian sense: the computer as a tool does not fade from the user’s awareness during its use as a hammer does in the act of driving in a nail. The direct manipulation metaphor of interaction makes the computer even more visible to the user, while other metaphors, such as communication, allow it to temporarily disappear. The complexity of computers, and their ability to carry out tasks of their own accord (when commanded, or, seemingly, with a mind of their own), are the basis for the human interaction, or communication metaphor.

Dialogue metaphors of interaction

From the earliest programming “languages,” dialogue has been a fundamental metaphor for how humans interact with computers. Human-computer interaction as a dialogue, conversation, or communication has been called the “initial constitutive metaphor” of human-computer interaction (Brahnam, Karanikas, & Weaver, 2011), and as a consequence of this metaphor, the computer is positioned as an entity with enough agency to carry on a conversation.

This metaphoric approach of interaction as a dialogue attempts to create interactions that parallel human interactions without being literal conversations or literally implying that “computers are people,” but the language surrounding computers and their inner workings has historically been and continues to be anthropomorphic in nature.

Even in the 1930s, before computers as we know them were programmed using a system of punch cards, a “computer” referred to a person who performed computations by hand. In the 1940s, most human computers were female and the amount of time it took for them to crunch the numbers was measured colloquially by mathematicians and physicists as “girl-years” or “kilo-girls” (Brahnam, Karanikas, & Weaver, 2011) in the same way we may refer to compile time and runtime for programs today.

Just as young women were employed in the service of performing calculations by hand during World War II, today’s computers are employed to perform calculations, answer questions, remember information, and assist in many tasks with enough agency and complexity that we characterize them as individuals and refer to them as “smart,” or “helpful,” (or, if they fail at their tasks, as “stupid”).

As Benjamin Fineman points out in “Computers as people: human interaction metaphors in human-computer interaction”: “When we say a computer is ‘stupid,’ we usually don’t mean that it has limited processing power, but rather that it doesn’t understand our intentions or behaves inappropriately. Conversely, a ‘smart’ computer seems to anticipate and react appropriately to our needs. Computers can appear socially intelligent without elaborate or complex artificial intelligence systems since they only need to display the appropriate behavior, not understand it.” (Fineman, 2004, p. 13)

Fineman goes on to explain that this appearance of social intelligence is defined by Erving Goffman as a “front”: “the set of signals – both appearance and actions – that others use to determine our social status, mood, intentions, and so on.” Likening this to Don Norman’s concept of “affordances,” that signal the availability of actions to be performed with or on an object, he demonstrates that computers signal social attentiveness: for example, using a flashing cursor to mark where text input is awaiting, or popping up an alert message atop the active program to convey urgency.

When the user interacts with these cues from the computer, it is not the programmer who created these affordances that the user feels they are interacting with; the death of the author is total, and the illusion that the computer is communicating of its own accord pervades.

Both Fineman and Ivan Hybs have pointed out the literalization of this metaphor in the rise of PDA or Personal Digital Assistant devices, such as the Palm Pilot. In this case, the computer is no longer referred to as a machine, but as a mobile companion whose role is to assist in communications. When a computer interface is intuitive, we refer to it as “user-friendly,” ascribing a persona and social role to the device based on how easy it is to use. The social inscription of computers as “friendly,” “helpful,” and “obedient,” is essential to how we are taught to use and think about computers: not only as tools, but as workers.

Literal conversation (Natural Language Interfaces)

Following on the ability to metaphorically converse with computers, this metaphor is literalized by the creation of natural language interfaces, which use spoken or written word to work with the computer. Within the definition of natural language interfaces is the implication that conversing aloud through speech is a “natural” process, a human function that comes easily and invisibly (Phan, 2017). As computer use has grown to include a broad spectrum of users with varying levels of expertise, the search for the most intuitive and easy-to-use interfaces continues.

While it may be easier for many humans to interact in this manner, it is not a natural interface for computers. Programming computers to understand human speech as input data, and to respond with human-sounding speech as the output, is a challenge in both directions. As explained by Charles Petzcold in Code: The Hidden Language of Computer Hardware and Software, one solution for the output is demonstrated by information systems accessed over telephone, where human voices are pre-recorded and broken into sentence fragments, words, and numbers, which the computer plays back according to input onto the telephone’s number pad. A slightly more complicated solution involves converting ASCII text to waveforms using a dictionary or pronunciation algorithms, and using pre-recorded phonemes to form whole words and phrases.

Speech recognition and programming computers to understand natural language input, Petzold writes, is a problem “in the realm of the field of artificial intelligence,” and requires rigorous training of the algorithm; but in the 18 years since Code was published, this technology has come a long way. Project Common Voice, launched by Mozilla in June of 2017, seeks to democratize the development of natural language interfaces by creating an open source data set that currently contains over 1.5 billion contributions by English speakers, and 45 other languages’ data is in the process of being collected (Mozilla, 2018). Even more recently, a demonstration by Google of a telephone scheduling system called Duplex in May of 2018 shocked the general public with how humanlike computers can now sound (Leviathan & Matias, 2018).

In their 1995 paper “The Role of Voice Input for Human-Machine Communication,” Cohen and Oviatt hypothesized many situations in which natural languages interfaces could be used, including telephone systems. Other tasks include situations where the user’s hands or eyes are busy, such as within manufacturing environments, while piloting a vehicle, or in a medical diagnostic context. They also observed the decreasing size of portable computers, and hypothesized that as screen real estate diminished, devices which were both computer and telephone (what we call smartphones today) would increasingly be controlled by voice.

Cohen and Oviatt also pointed out the advantages of natural language interfaces for the disabled: the deaf would have access to instantaneous speech converted to text, and the blind text-tospeech. They also noted that speech recognition could be used by the motorically impaired to control home appliances, mobility technology, and prostheses.

Although it may in general be faster to read (Don Norman cites an average reading rate of 300 words per minute, or a skimming rate of up to thousands of words per minute, compared to an average listening rate of 60 words per minute (Norman, 2013, p. 267)), natural language interfaces have been shown to increase efficiency in other ways.

Early studies by Cohen and Oviatt on natural language interfaces produced results showing that out of ten different communication modalities, the one most effective among teams in a problem-solving exercise was speech. Single-word commands were found to be equally fast in interacting with certain programs as clicking a mouse or typing a single-letter command. Circuit designers were able to accomplish 25% more tasks when able to use spoken commands in addition to a keyboard and mouse interface (Cohen, Oviatt, 1995, p. 9923-9924). Later studies by Richard E. Mayer and Roxana Moreno, in 1998 and 1999, also confirmed that speech was found to be superior to visual information in studies of cognitive psychology and multimedia education (Baylor, 2011).

Even when speech is found to be less efficient, it is often preferred simply because it is a more expressive and natural mode of communication and requires very little training. If the goal of human-computer interaction is to make interfaces easier to use or “friendlier,” then transferring the use of a skill most people have cultivated their entire lives is one of the most logical choices with many clear benefits.

However, tapping into the social parts of the human brain to literalize the communication metaphor of human-computer interaction with natural language interfaces is not without some unintended consequences.

Psychosocial effects of conversing with computers: Computers Are Social Actors (CASA) and Actor-Network Theory

Actor-Network Theory is a method of thought that privileges non-human objects as actors (Moore, 2012), and fits nicely with the conception of computers as objects with agency – in fact, it seems much easier to conceive of computers as actors than most objects, because of their capacity to “think” for us (even though we know objectively that the thoughts of a computer are simply the result of electrical circuits and programming) and to communicate. Actor-Network Theory extends the communication metaphor of human-computer interaction to assert a degree of intelligence or agency within the computer which is implied by the delegation of tasks and responsibilities given to it.

This unconscious bias to privilege computers’ intelligence above other objects was explicitly explored first by Clifford Nass and Jonathan Steuer in 1993, finding that four characteristics strongly encourage a social response to an object: the use of language, a human-sounding voice, interactivity (“defined as how much the system uses prior input to determine its subsequent behavior” (Swartz, 2003, p.13)), and the conferrence of a social role to the object. In the same study, they found that people respond to different voices coming from the same computer as different social actors, and the same voice coming from different computers as the same social actor (Wang et al., 2007). Another study, a year later, by Nass, Steur, Henrickson, and Dryer, found that “minimal social cues” were required to produce this effect in computer-literate individuals.

Clifford Nass and Byron Reeves produced an expanded version of this theory, the Computers Are Social Actors (“CASA”) theory, in their 1996 paper “The media equation: how people treat computers, television, and new media like real people and places” (Fineman, 2004).

What does it mean to treat a computer as a social actor? Social presence, as defined by Short et al. in The social psychology of telecommunications, includes verbal and nonverbal cues of behavior (Baylor, 2011). Specifically, it was found through subsequent research that social responses to computers included:

differing interactions between similar and dissimilar personalities, implying computers’ possession of personalities (Nass et al. 1995)
teamwork and interdependency (Nass, Fogg, and Moon, 1996)
gender stereotyping (Nass, Moon, and Green, 1997)
response to flattery (Fogg & Nass, 1997)
attribution of responsibility (Moon & Nass, 1998)
enacting social norms of politeness (Nass, Moon, and Car- ney, 1999)
reciprocal behavior, i.e. information exchange and turn-taking according to social norms (Fogg & Nass, 1997; Moon, 2000)

Most importantly, these behaviors were observed to be entirely unconscious: in fact, Nass and Reeves found that participants questioned afterwards explicitly denied exhibiting social behaviors towards computers, but that they actually did, regardless of their level of technological proficiency. Even if a user is fully aware they are interacting with a machine, if the machine they interact with possesses a human voice, language fluency, a social role, and appears to respond in a minimally socially acceptable way, the user will treat it as a human.

Youngme Moon found that this tendency emerges “whether the representation of the computer is the screen, a voice, or an agent” (Moon, 2000). So while it may not be strictly necessary to encourage these social responses by adding a visual representation of an agent to the computer, doing so will inevitably produce them. The intention behind providing an embodied agent in designing conversational interfaces is to enhance this subconscious effect.