What are you building?
Jeff Smith: John Done is a telephone voice assistant. By that we mean he can make an answer phone calls for you and even join you on calls to get things done. It is about using AI to help you on phone calls and take some of the pain out of talking on the phone.
Why this problem?
Smith: A couple of reasons. One of the things we noticed is that there has been a deployment of voice technology in customer service in large companies but the way it is being used is in a consumer-hostile way. You have to use a phone tree or messaging channel that is usually now a bot and not a very good one. It’s not typically AI, but really just and FAQ bot. As a consumer, that environment is frustrating because no one is on your side.
My co-founder and I were experts in building intelligent agents—Amy at x.ai. We realized we could build an intelligent agent that works on your behalf. The more you dig into this, the more it becomes clear that no one is really working on your problem. People speak to their family and friends, but the rest of the things they are spending their phone time on is kind of irritating.
You want to know if there is a wait at a restaurant. You want to go to the clothing shop at the store down the street, but want to know if they have flip-flops in stock before you do. The only long-tail solution for these types of problems is the telephone. No one is making it easier for you as a consumer. You have to call and find out. You wait on hold. Big companies are trying to control their costs and put that time on you. We think the solution is to put technology in your hands to put time back in your day.
Smith: Now is a pretty exciting time for vocal computing. It is a lower level innovation necessary for anything that operates through spoken communication. We now have near human levels of speech transcription. It’s closer in Chinese than it is in English, but still very good. In a year or so, we are going to have synthesized voices that are indistinguishable from human voices. It is an opportunity take advantage of voice computing technology that is very useful and solve people’s problems.
The antiquated IVR you get when you call an airline for example was developed on a fundamentally different technology. The vocal technology advances are changing what we can do. You see it a little on smart speakers and mobile. It truly gives us an opportunity to provide ambient technology that will go out and work on your behalf. That simply wasn’t possible before.
Smith: x.ai is a pretty unique company and product. My co-founder, Wesley Harris, and I left Intent Media to work on x.ai together. We were excited about the technical mission of the company that operated autonomously and executed nontrivial tasks. The AI out there at the time didn’t really accomplish tasks. Most people were working against things that already had APIs. Telling a smart speaker to do those things is similar to a button press. x.ai negotiates a human transaction that is a complex social process.
We saw the power of building intelligent agents that take on work for people. We saw how important that sort of problem framing [into a specific domain] can be. We saw how much better that got quickly and how you could build agents that addressed other problems. That pointed to how we could apply AI to a different problem.
x.ai famously has people behind the AI to support a good outcome when the AI is struggling and also to label the data for learning. Is that the John Done approach?
Smith: It is not. Data labeling is a smaller component of what we have to do. The x.ai founder likes to talk about supervised learning at conferences. It was going to be impossible [for x.ai to succeed] unless they invested in high quality data.
In this case, we likely will have to invest more in terms of humans doing data labeling in the future. However, we have started by using technology provided by big tech companies and we do this in an offline training mode. We are engineering your dialogue and then have to test them in the wild.
x.ai is trying to build the most sophisticated classifiers in NLP and understanding around extremely fine grained components around time and people and things like that. At John Done we are staring with something far simpler by starting with conversations that don’t have the semantic richness. We might move into canceling your cable service or negotiating a lease in the future. But, we have started with simpler questions. We are not trying to extract the same range of intents.
Are you focused on single queries and limited turn conversations?
Smith: The tasks we have intentionally ordered at this stage in our company lifespan are simple questions. We have a long list of tasks on our roadmap. It is entirely possible to start with simpler dialogues and make customers happier than bespoke language interactions.
We also have the ability for retries, to give the call a second shot. We can make multiple calls over time. These are things you don’t want to do when scheduling a meeting. But, for finding a place with a certain item in stock and booking a hair appointment next week, we have a a longer leash. It puts less onus on the natural language processing (NLP).
Not too far down the path we will add some niceties that make it even more useful for people. These are things like time shifting. We can look up open hours on the internet and make sure we are calling when a store is open.
How far along is the product?
Smith: Certain types of pre-defined commands work just fine. I have my laundry picked up using John Done and that works just fine. A format for scheduling a service like the plumber or the cable guy, things like getting a haircut, massage, restaurant booking are only a little more complicated than that. [Adding these will] make it a lot more useful pretty soon.
What do you think about services that start with humans and then attempt to transition to AI?
Facebook M and [other services] were founded on the same premise that you can learn from human interactions [and then train an AI]. I do not believe you can take non agent dialogues and build out something that works for AI. Human dialogues simply have too much heterogeneity and don’t allow you to do this. x.ai and John Done understand that you must scope the domain sufficiently and start with AI. Otherwise, it will computationally impossible to develop useful training data.
Are you building your own NLU?
Smith: We are building on top of the most powerful stuff that is out there in the industry. We use enabling technologies form the big tech giants. We are using API.ai. On vocal recognition we are using IBM Watson. For speech synthesis we are using a Scottish company called CereProc. They have the best commercially available voices right now. The next closet thing we have used is Amazon Polly which is decent for voice synthetics but not as good as CereProc.
What is your favorite AI-based application other than John Done?
Smith: My co-founder and I are using Alexa and Google Home, tied to android phone and tied to Chromecast, in a comparison test. A unique capability of the Google platform that is clear from a developer perspective is they have well structured highly curated domains so you can look up facts. There are a lot of enterprise use cases that will come out of that.
How do you think about John Done in relation to the voice platforms such as Amazon Alexa and Google Assistant?
Smith: We want our stuff available everywhere that can provide capabilities out to users. So far, Amazon has done the best job of making those available to users. The interface is a relatively small component of the stack. Whether you are in the kitchen or on the street or in a meeting we want to give you more places to get to John Done. That’s how busy people work. Modern lives are kind of messy. You can be everywhere and should use whatever conversational agent is most convenient.
Editors note: The video below of John Done’s Voicecamp demo is a good opportunity to see the technology in action. The product demo starts at about 2:30.