What are the Limitations of Voice User Interfaces (VUI)? Product Hacker – Episode 3

March 1, 2019
Siara Singleton

Share Podcast

On this episode of Product Hacker, we’re joined once again by Arcweb Head of Engineering, Shahrukh Tarapore, and Principal Consultant Mark Hughey. The emerging tech of discussion today will be voice interfaces and their application in the market. Are voice assistants advanced enough for diverse consumer profiles and could Alexa-like services ever have a place at work?

Kurt Schiller [00:00:02]: Welcome the Product Hacker, the Arcweb business innovation podcast. We bring you the latest from the world of business innovation from emerging technologies to game-changing ideas. Product Hacker connects you with the people and concepts that are changing the face of business. I’m your host, Arcweb Head of Marketing, Kurt Schiller.

Kurt Schiller [00:00:24]: We want to thank everyone who visited Arcweb at South by Southwest Amplify Philly House. We had a blast chatting with entrepreneurs, innovators and business leaders from Philadelphia and beyond. If you didn’t make it out, make sure you take a look and Amplifyphilly.com to see what you missed.

Kurt Schiller [00:00:43]: There’s been a major uptick in voice interface adoption spearheaded by consumer-use cases like Siri and Amazon’s Echo. Enterprise growth has been slower, but businesses are beginning to explore using voice technology to improve efficiency and provide new ways of reaching customers. But there are still tech challenges in the way. Not to mention, interface challenges.

Kurt Schiller [00:01:04]: Here’s Arcweb Head of Engineering, Shahrukh Tarapore and Project Manager, Mark Hughey.

Kurt Schiller [00:01:11]: So, Shahrukh, I want to start off with a question for you. About five years ago, everyone was talking about natural language processing (NLP) as falling under the umbrella of machine learning and artificial intelligence. People seem to have stopped talking about it, but now we’re talking about voice interfaces all the same. What happened there? Where did an NLP go?

Shahrukh Tarapore [00:01:30]: So I think NLP really just became a bigger thing than one of the many algorithms underneath the umbrella of machine learning. And so it kind of got taken out and kind of put into a box on its own.

Shahrukh Tarapore [00:01:44]: The other aspect of NLP is that there’s still a lot of like huge challenges with accents, with uncommon words like take my name, for example. You know, my name is Shahrukh Tarapore. And you know, when I tell my name to Android, sometimes I get “Shark Tar Paper” and it is definitely funny. And, you know, I have a good joke with my friends and everything. But, if someone needs to call me in an emergency and my Android device or someone’s Android device can’t get my name correctly and it thinks that you’re trying to call a fish, that’s a big problem.

Shahrukh Tarapore [00:02:14]: There’s definitely limitations there that are continuously being worked on and getting better. I think that for wider adoption to occur, these things are gonna have to get worked out because we are living in more cosmopolitan societies, more complex types of words and phrases and diverse accents. And cultures are kind of getting into our mainstream. And you do need to be accounted for when technologies are becoming part of the common use in business.

Kurt Schiller [00:02:42]: So we say that and yet I just looked it up and apparently so far there have been 20 million Amazon Echo sold. So clearly it’s not that much of an impediment, at least in certain use cases. What’s going on there?

Shahrukh Tarapore [00:02:55]: One of the things about natural languages, as opposed to computer languages, is that our natural languages have a lot of contexts that are built into, say, body language or into, you know, using pronouns or other sentence structures that assume the listener knows something about what was said previously. And that’s very difficult for machines to pick up. So where I think we’re seeing a lot of– a lot of excitement around natural language processing and around voice interfaces is in very well-defined use cases for very specific needs. So like, I’m gonna buy a certain thing. There’s a very specific commerce type application there for consumer goods and training natural language engines for understanding the types of words people will use and the variety of ways they could say those words is a well-defined amount of scope.

Shahrukh Tarapore [00:03:47]: When you try to extrapolate that to general conversation and to all the different types of topics that people can talk about that’s when I think we start to hit the limitations of the state-of-the-art that we have today.

Mark Hughey [00:03:59]: Do you think that this is a problem, though, that necessarily needs to be solved by people that might potentially be our clients? So the reason I ask is it goes back to our discussion about blockchain technology and about sort of the difference between the platform and the application. Kurt pointed out, 20 million echoes have been sold and so there’s a platform there.

Mark Hughey [00:04:25]: And I sort of feel like– I feel like that’s not necessarily a problem that people interested in building voice interfaces or voice applications necessarily need to solve because Amazon is already working on solving it. Google is already working on solving it. And the question is, how can you provide value to your customers, your user base, whatever, by building something on top of that platform?

Shahrukh Tarapore [00:04:51]: Yeah. I don’t really expect every company to go out there and try to solve the natural language processing problems. But I do think that– it’s a continuous– there’s always gonna be room for improvement. And the Amazons and, you know, the like are going to always be out there improving that technology and other businesses will be able to leverage that technology to provide interfaces. I think it’s about knowing where the limitation is and then making sure that your experience can still provide value to a user in light of those limitations and not be swayed by the idea that this is a mature technology that has all the edge cases sorted out now. And so, the variety of which we can create interfaces and experiences are kind of equivalent to the way we can do them on a screen, for example. We’re not there yet. So I think we need to pay a lot more attention in the case of voice interfaces to those limitations and make sure that our– the experiences we’re creating through them are tolerant of those limitations.

Kurt Schiller [00:05:48]: So to make explicit something that I feel like both you Mark you Shahrukh are dancing around is the use case here. What is the requirement for a good voice use case right now? It sounds like Shahrukh you mentioned something that doesn’t require extra context. Maybe is imperative of a command, “thing do this.”

Shahrukh Tarapore [00:06:11]: Yeah, I think it’s about the fidelity of the information. You know, if you’re a stockbroker, right, and you’re used to having five screens in front of you, showing you charts of various different trends, you’re not gonna be able to take that amount of information and convey it in an auditory way, or at least not in a way that can easily be consumed. You know, attention needs to be paid to the right interface for the right scope of information. And when you need to have that information, the good interfaces that come out will be ones that strike the right balance of information delivery and conversational interaction between the natural language processing and the end user.

Mark Hughey [00:06:50]: You know, there’s somewhat of an intersection here, though, and I would challenge you a little bit on that in that the stockbroker with the five screens of data, right. Yeah, you can’t reproduce all of that. But presumably that stockbroker is looking at five screens of data because they– each of those five screens has a piece of information that he needs. And when he aggregates that information, he’s going to be able to answer a question that he wanted, that he needs to be able to take some sort of action. And so, you know, this is where we intersect our previous thread about machine learning and things like that, where if you have the ability to make the connections on those data, those different data points proactively, and reduce the complexity of consuming the information down to what is the, you know, what’s the answer to the question that you’re looking for. And then now you’re in a realm where conversational interfaces become more useful.

Building a digital product?

Kurt Schiller [00:07:45]: So are we going into a world then where we could start seeing something like an Alexa for work, for business? Because I keep thinking about how my 2 year old daughter has almost figured out how to wake up the Alexa and tell it to stop playing music. And I’m just imagining working in an office setting where there are a bunch of voice interfaces going and people yelling over each other. And it just to me it sounds like a cacophony. What are the challenges of bringing a voice interface into a work environment and how do you do it correctly or how do you find the right use case to do it correctly?

Mark Hughey [00:08:21]: I’m stuck on the thought in my head, you know, we say, you know, Alexa for work, you know, I’m thinking about old episodes of Star Trek The Next Generation. Where whatever, you know, just think about any instance where you just want to say “Computer do this for me. Turn off the lights, turn down the temperature”, whatever. The challenge so far is that, you know, a lot of the use cases are focused around where voice is really obviously beneficial, like, say, when you’re driving in a car or, you know, any activity where having hands- free, having a hands-free interface is valuable. That’s where– that’s where something is potentially really beneficial to have a voice interface for.

Mark Hughey [00:09:06]: But in terms of some of the challenges at work. So one of the big challenges is security. We were talking about sort of the security and ownership of data. Well, if data is being projected in an audible way, then you introduce all kinds of liability and risk when the environment is not secure.

Kurt Schiller [00:09:26]: I’ve heard a joke of a robber standing outside of a door for an office yelling, “Alexa, unlock doors”.

Mark Hughey [00:09:37]: Yeah, I mean, it’s a good joke. And that actually is another that’s sort of another way. You know, the first example I gave was an authorized user is interacting with the device and someone who’s not authorized is hearing what’s going on. The example you’re giving is someone who’s not authorized to use the device gains access to it because potentially the– how does the device know whether the person they’re talking to is authorized or not?

Mark Hughey [00:10:03]: So we’re looking into pin activation in voice recognition. But again, I think that voice recognition is going to be one of those things that the platform provider is going to solve. We’re already seeing that with Amazon. They’re working with Alexa so that in-home it knows whether it’s your kid talking to the device or whether it’s you. And it can serve up context-specific content based on that recognition.

Shahrukh Tarapore [00:10:29]: Yeah, I have– and not to be a conspiracy theorist or anything, but I have a friend of mine who’s always skeptical of their Amazon Echo and thinks that when they’re having a conversation with, you know, a member of the family that like– the echo is like recording it or using it to figure out like, products and services that Amazon can sell to you. So, you know, in a business context, you know, it’s not just the concern of is there a third party that’s listening to the commands that I’m giving to the Echo or to this NLP device. But it’s also, you know, am I having a conversation with a human being in the same vicinity as when these devices and what is it picking up and what, you know, what is it doing with the information? So there’s kind of side effects to using these interfaces that are gonna adapt the way that we interact with technology that we haven’t really considered yet, I think.

Kurt Schiller [00:11:20]: Perhaps this is a question for one of our designers instead of you two, but I’m going to ask it anyway. It seems as if there’s something uniquely frustrating about a voice interface not doing what you expect it to do compared to a mouse and keyboard interface or some kind of a screen interface.

Kurt Schiller [00:11:38]: And I wonder if that’s just the fact that we expect a very high fidelity from other humans at understanding what we’re saying. So I guess to turn this into a question, how do we make a voice interface helpful and discoverable in the same way that the existing interfaces that we have can be. And is that something that’s necessary for the use cases that we’re trying to take on now?

Shahrukh Tarapore [00:12:02]: I think it’s helpful to remember that the underlying technology of a voice interface is unimaginably more complex than the underlying technology of a mouse and a keyboard. The keyboards and the mice that we use today, their predecessors were mechanical devices. There was at some point that mechanical input was translated into an electronic input that the computer could then understand and move your mouse to some place or click on a button and do something, right?

Shahrukh Tarapore [00:12:27]: So what it took to express intent to an interface for a screen or a physical device, that intent was much easier to deliver on for the user and for the technology. Whereas, with voice, because natural language processing stems from machine learning, there’s this kind of universal understanding in machine learning that the trial information or data that you give to any type of machine learning is going to dictate how well it performs to predict the future thing that you’re not testing on. So, you know, it’s kind of like the idea of garbage in, garbage out. Right. So if the data that you’re using and the way you’re training, you know, a natural language technology to understand someone’s natural language in a real use case is limited. If that test case is limited, then the ability for your voice application to understand things that we’re not part of, that the training data is limited.

Shahrukh Tarapore [00:13:26]: And so we’re always going to be looking at how can we continuously be training these voice interfaces and these natural language technologies to have a wider breadth of context of the different types of things we could say that the different types of slang we might use as natural language evolves continuously. Accents, you know, the variety of things that make kind of natural language so compelling and so diverse. And that’s that’s very difficult. And that’s the underlying fundamental technology that is driving these interfaces. So we have to realize there’s a great amount of complexity there.

Mark Hughey [00:13:59]: I think that’s a really good point. And it’s it reminds me of something we’d been wrestling with here at Arcweb when we talk about interfaces and you mentioned maybe there’s a better question for the designers. And I think when we talk to our clients about, you know, the teams that we put together, you know, having a designer on the team in the context of designing a visual user interface, that’s an easy concept to get. Everybody understands that.

Mark Hughey [00:14:28]: But having a designer that is thinking about the user experience, how a person moves through the use of a product, how that experience flows from one step to the next, and making that journey through the use of the product as frictionless as possible. That’s a bit of an art. You know, we see a lot of challenge in being able to convey the value of that to our clients, because, again, it goes back to what kind of experience are your users going to have?

Mark Hughey [00:15:01]: Are they going to going to be delivered value in the way that they thought that they were going to deliver value? Was it a pleasant experience that’s gonna want to want to keep them coming back and using it over and over again?

Mark Hughey [00:15:12]: And so, in the world of visual design, we have these like very established patterns that can be drawn on and lessons learned because obviously that medium has been around for a lot longer than voice. And so, when we’re designing voice products now, we internally are writing from scratch. You know, what is the– what is the user experience best practice guide, you know, that’s gonna help us evaluate whether or not we’re designing something correctly and the experience is going to be pleasant and useful.

Mark Hughey [00:15:48]: And there are some very unique challenges because of all the things we’ve talked about already with voice and context and accents and just the limitations of what can be understood by the technology. Right now, I don’t necessarily know whether we have a clear answer to it yet, but it’s kind of– that’s kind of because we’re on the edge of innovation here.

Mark Hughey [00:16:10]: We are continuing to evolve those best practices and see what’s capable and of course, the landscape is constantly changing with the improvements in the technology, too. So all of those things need to be monitored and need to be enveloped into ongoing projects. But I think it certainly presents a lot of exciting opportunities for people out there.

Kurt Schiller [00:16:33]: And I think possibly one of the benefits similar to what you were saying about the growth of IoT is that clearly there has been an established consumer demand for these voice interface devices. So, perhaps we’ll get those examples that we need to establish the guidelines of what building a good voice interface is and perhaps quite quickly given the adoption rate that we’ve seen.

Mark Hughey [00:16:57]: Yeah, there’s no doubt, especially with the adoption rate you mentioned, you know, 20 million devices sold. That number is going to double this year. You know, there’s going to be 40 million people. Amazon has a huge early mover’s advantage by inventing the echo. Two-thirds of all of those people are going to be on the echo device this year interacting with it. So the time is right to develop something that can be consumed via voice.

Kurt Schiller [00:17:24]: So once again, it sounds like we already have our answer to the question before I ask it. But I’ll throw it right back to you, Mark, since you basically just– you basically just showed your cards.

Mark Hughey [00:17:33]: Yeah, I did. I did. Sorry for the spoiler.

Kurt Schiller [00:17:36]: So, so go forward or wait? It sounds like you’re a strong wait?

Mark Hughey [00:17:41]: No, no. Yeah. So I think go now. Absolutely. Especially for those reasons of adoption is on the rise. There is also– there was already a massive audience that is accessible to you through these platforms that are there. And in addition to that, there’s both Amazon and Google, where we’ve been focusing our attention so far. Both of those platform developers offer really robust development kits that allow people like us to interface with them and very quickly and rapidly test protoypes and get products to market. Those advantages are huge. So all of that infrastructure and that ability to quickly innovate and get to market is there already. So yeah, go now.

Kurt Schiller [00:18:33]: Shahrukh?

Shahrukh Tarapore [00:18:34]: I’m going to shock the world and say, wait. But, I agree with everything that Mark just said. And I think that if you have a use case that is very transactional, then now is the time to go for transactional purposes. You know, I have a command like, you know, “what is the weather today?” Or “buy more cereal”, whatever it might be like that? Get on it right now and it’s definitely go.

Shahrukh Tarapore [00:18:56]: But where I think the wait comes in is if your user or your product expects a conversational style to the use of that product or that or the way that someone’s going to get value out of it. And there’s– and you’re going to be building on the interaction that you have with this voice interface. I think it’s a wait because we’re just not there yet in terms of being able to conceptualize how context is accrued and then used to understand the next thing that the human is saying and then keep the level of conversation going between the interface and the user. So, it’s a wait and see on that one.

Kurt Schiller [00:19:32]: So that’s it for our new technology overview. Mark and Shahrukh, thanks so much for joining us.

Shahrukh Tarapore [00:19:37]: Thank you for having us.

Mark Hughey [00:19:38]: Yeah, thank you. Pleasure.

Kurt Schiller [00:19:39]: Thanks for tuning in for Product Hacker. Join us next time as we take a look at some recent news developments with impact on the product world including the Facebook privacy scandal and consolidation in the healthcare industry. If you’re enjoying Product Hacker, make sure you like, subscribe and sign up for a newsletter to get notified of new episodes and upcoming events.

Kurt Schiller [00:19:57]: Product Hacker is brought to you by Arcweb Technologies, a digital product design and development agency in Old City, Philadelphia. Learn more by visiting us at Arcwebtech.com or calling 1-(800)846-7980.

Share This Podcast With Your Connections

About The Author(s)

Siara Singleton

Siara Singleton is a Marketing Associate at Arcweb Technologies who writes thought leadership blogs about digital transformation, healthcare technology, and diversity & inclusion in the tech industry.

View All Posts

What are the Limitations of Voice User Interfaces (VUI)? Product Hacker – Episode 3

Building a digital product?

About The Author(s)

Recent Podcasts

Ctrl M Health: The Intuitive Migraine Management Platform

How will the 21st Century Cures Act Impact the Healthcare IT Space?

Patient Transportation: The Hidden Barrier to Care

Let’s Collaborate