Sunday, May 06, 2018

Alexa meetup: Designing Multimodal Skills

Yesterday, I attended a meetup on designing multimodal skills for Alexa, and in this post I'll share some of the interesting pointers from the presentation and discussion.



-> We are in the era of Voice UI


While terminals were the primary mode of interacting with computers when they were first invented in the 70's, systems have evolved over the years to support different types of interaction paradigms - from GUI, to Web, to Mobile. In one way, 2010's are are the era of Voice User Interface (VUI).

Voice comes naturally to us, and we have been using it for thousands of years for interacting with one another. Voice, is the next big computing platform

-> Cloud enables experiences that were not possible earlier

While sentient chat systems and bots have been imagined forever, our efforts used to lack earlier because of the limited computing available to the edge machine.


For example, designing an AI assisstant like Alexa broadly involves many complex steps, like:

  • Speech Recognition
  • Machine Learning based Natural Language Understanding 
    • convert user's utterances to an intent
  • Text to Speech

This was not possible earlier when all the processing was done by the device. Cloud computing enables AI like Alexa to flourish, by offloading all computing from the end device.

-> Multi Modal experiences are the way forward

Multi modal experiences refer to applications where there are multiple modes of experiencing the skill. For example, with an echo spot, your users can have both voice and visual experiences.


While the focus is always on voice first apps in case of Alexa, experiences can now also be augmented with the help of visual cues.

-> The introduction of multi modal approaches call for new design principals

While Alexa is not yet suited for cases where there are long of list of items, or complex nesting between them, there are some general design guidelines that can be followed:
  • design voice first - you just don't know if the user will have a visual feed or not
  • do not nest actions within list items - it becomes poor Voice UX
  • choose images that look great on all devices - while echo spot has a circular screen, an echo show has a rectangular screen
  • use font overrides sparingly, and markups in meaningful ways
  • a good way to design better Voice UX is to write the interactions down and read them in a roleplay - you better change it if it doesn't sound right


The presentation was followed by a hands on Alexa development session, where attendees created a fresh alexa skill for space facts, and deployed a pre-coded lambda on the cloud from the serverless repository. This was a standard JSON - in - JSON - out kind of session, which helped familiarise participants with Alexa developer portal and lambda deployement process.

The meetup ended with a presentation by team YellowAnt, who were demoing the public beta notifications feature of Alexa. YellowAnt is a chatops startup, and the gist of the demo idea is that the Alexa now supports notifications in beta. These notifications can be leveraged to ping end devops users to notify them of system updates (downtime/deployments done etc).

However, given that Alexa is a voice first ecosystem, it was very interesting to hear the Alexa AI pronounce lengthy text and URLs character by character, and try reading multiple notifications one after the other. All this would have made sense as strings over email/chat notifications, but ended up loosing all the context when delivered via voice. To me, this re-emphasized the need to design voice first applications with Alexa.

Overall, I found the meetup very helpful in understanding the Alexa ecosystem, and learnt a lot of cool new things.

1 comment: