5 tips for better multimodal design

6 min readAug 29, 2022


Title image courtesy of Voiceflow! where this was originally published

It’s a multimodal world out there

As humans, we absorb information in a variety of ways. Our 5 physical senses (i.e. sight, hearing, smell, taste, touch) all work together to build our understanding of the world and the realities we face on a daily basis. They are also pretty handy when it comes to helping us reinforce certain behaviors or discourage us from repeating mistakes. A classic example of this is associating an open flame — its shape, color, heat, & crackle — with the danger zone. Put your hand too close and you might burn yourself! The more we experience these combinations of inputs (our physical senses reporting on that open flame to our brain), the more real that experience becomes, and the less we try to set ourselves on fire.

Most of our interactions follow that same pattern. We get information about our environment through different inputs, to then produce an output, or reaction, to that information. Multimodal design is applying that same essential principle of how people navigate their world to designing machine-to-human interactive experiences. It’s still UX design, but instead of designing for one modality, you have to juggle 2 or more, while also taking into consideration the pros and cons of each.

What is multimodal conversation design?

We know what single-modality interactions feel like. Appreciating a fine work of art or quickly scanning a new drive-thru menu for the tastiest picture both have in common that you’re generally only relying on one of the physical senses: sight. It doesn’t take away our life experience to only use one of our senses. It’s something we do all the time. But if there were an option to Instagram “like” something in real life, wouldn’t we want to tap that button?

Multimodal design aims to make life easier by giving users natural shortcuts through an interaction via different modalities. The aim of a multimodal conversational experience is to present and receive information in an intuitive way. Basically, you design multimodal conversational flows like how they would happen in a real life conversation. This means switching across contexts, capabilities, or even devices.

💡 A clarification: multimodal vs. multichannel‍

Multimodal interactions may include features that allow a user to start an experience in one communication channel and continue in another. For example, smart TVs allow for easy viewing of photo galleries or videos from a conversation that may have started with a voice-only medium (e.g. smart speaker). The golden rule is that not all multimodal are multichannel, but multichannel experiences are multimodal.

While there is no hard and fast method of designing multimodal experiences, the concept itself has been at least thought about in the tech industry since the 90’s. Intentionally designed experiences work, while others can immediately come off as distracting or overwhelming.

Here are 5 of my rules of thumb I use to design multimodal conversational experiences:

  1. When in doubt, draw it out
  2. Context, context, context
  3. Eliminate the competition
  4. Stay consistent, everywhere
  5. Prioritize accessibility

When in doubt, draw it out

Multimodal conversational experiences have a lot of moving parts and it can be easy to lose sight of the journey as a whole. Writing prompts without taking the accompanying visuals into consideration can be kind of like buying a lottery ticket and leaving it unscratched: your design will have all the potential in the world, but no one’s gonna take it if you try to cash it in. An easy way to keep the conversational flow in perspective is by making flowcharts with the dialog and wireframes (mockups that show the general placement of content on a screen) or finalized visuals side-by-side. There are others in the industry who also like sketching storyboards (like how they do in Hollywood!). This method not only gives you insight into the timing of what a user might see or hear, but also lets you place the conversation within its context.

Context, context, context

Context in multimodal design is essential. Knowing where your users are while they progress through different steps of the journey can reveal pain points or opportunities in your design, especially if the journey requires switching between devices (for more information on this, I highly recommend Cheryl Platz’s book). It can help you understand the advantages of revealing or taking in information through one modality over another. Generally, you don’t want your user to reveal sensitive information orally if they have the option to type it out into a form on a screen. But there’s more to it than that. Keeping yourself informed of your user’s context can even stop you from forcing features on your users that are simply not feasible — like how some ecommerce websites have a checkout timer that empties your cart if you don’t get your payment information in before the time runs out. Consider your user in their environment to keep the conversation relevant.

Eliminate the competition

When everything in a design competes for a user’s attention, nothing wins. Too many screaming elements in a journey can actually push the feel of the UX into gimmicky territory. As designers, we need to remember that very few things about our designed products actually live “rent free” in our users’ minds. Keep in mind that there is an order to how people absorb information. In visual design, UI elements are intentionally placed according to how the human eye would scan the page. For English speakers, that usually means content will be scanned from top to bottom, left to right. In multimodal design, where the audible, visible, and tactile compete for attention, it’s less straight-forward. While it is true that humans read more words faster than they can hear them, you may not always want to prioritize speed. Depending on the use case, you may choose to wield audio for branding to leave an impression and set the tone more clearly than repeatedly making users absorb the visual and written elements on a screen (this is one of the reasons why sonic branding is having its moment). Each modality has its advantages. The key lies in emphasizing one at a time.

Stay consistent, everywhere

Visuals usually come with text. Very rarely do you see a website out in the wild with NO text at all. That’s just bananas, and super inaccessible. All written copy on a screen should be purposefully written in a way that aligns with the journey objective. Say, for example, you design a bot personality to be youthful, fun, and curious. If you insert a microcopy into your design that is usually more professional in mood (think “Submit”, “Next”, or “Forgot password”), it would completely misalign with the personality you designed! Think of the copy like an extension of the conversation. Like when a memorable video comes up in conversation and you just have to search for it on your phone to show to a friend. You don’t change your entire personality to share something with your friend. If you did, your friend might start considering hanging out with you less. Sloppy, inconsistent copy can breed a user’s distrust. Stay consistent, even in the little details.

Prioritize accessibility

The power of multimodal experiences is huge. Technology in this form can now reduce friction for more people and include more people in the conversation. However, as we all learned with Spider-Man, with great power comes great responsibility (I’m sure Uncle Ben had multimodal designers in mind when he said that). We need to start talking more about accessibility at the design stage. It sets the tone not only for how conversations are shaped, but also how easy they are to use. As the lead conversation designer for Voice Compass®, it’s my responsibility to make sure we’re asking the right questions at the beginning of our design process, and to find out if we’re accidentally excluding users with our product. Part of it includes knowing how to write alt-tags for our images or realizing that visual loading screens don’t need an alt-tag, but would require an aural notification. The bigger picture of it is determining whether all end users of a multimodal journey can easily complete a given task from A to B without major roadblocks. Design can be delightful, yes, but it should also be usable for all kinds of people.

A multimodal future

I believe we’re heading toward a future where multimodal is the norm and no longer a buzzword. Beyond the smart home, I envision a smart city, with smart neighborhoods that let you opt in to automated experiences based on more inputs than the ones we see used today. Even now, It’s an exciting time to work in multimodal design. Designers may not have all the answers or agree on what makes great multimodal design, but there are now enough resources to get started. Voiceflow has allowed for multimodal prototyping, and more industry professionals are sharing their learnings, like the new Conversation with Things book by Diana Deibel and Rebecca Evanhoe. It’s a multimodal world out there and it finally seems like our technology is catching up.

Originally published on June 23, 2021 on voiceflow.com