You have been in speech recognition technology for two decades, what have you learned?
What’s funny is that everyone I know that works in speech gets kind of myopic about their corner of speech and I was guilty of that as well. I was doing desktop dictation and I thought it was the center of the industry. I then moved to Yap and transcribing voicemail messaging and then I thought that was it. Then I moved to Amazon and I thought that was central. Everyone sort of imagines whatever they are working on is where it’s all at.
The industry is a lot bigger than everyone thinks it is. I don’t think I learned that well until I started Cobalt three years ago. Suddenly, I had people coming to me and I had no idea that people were interested in this or that or developing products to do new things. Some of it is cutting edge. We have a partner that is using speech technology to diagnose diseases. It is a complicated enough process that when there is something wrong with your brain or respiratory function, it can be detected in speech. That’s not really in the news, but I think it will be really big business in 2-3 years.
I also have 4-5 customers that have approached me separately to help people learn a new language and help coach them on their word choice and pronunciation. I wouldn’t even have thought that was a deal at all, but it is now the single largest category. I could go on, but the point is that the field is bigger than I thought it was. There are people doing very interesting things and some of it won’t be public for a number of years, but it is a broad and diverse industry.
Tell me about Yap and the Amazon acquisition?
At Yap, we were really focused on accurate transcriptions of voicemail messages. I was able to attract a dozen of the best speech scientists in the industry to come and work with me. We were able to get our speech recognition to an equivalent level of accuracy as a competitor that was using human transcribers. That was the big technical accomplishment. It wasn’t a big single invention or breakthrough. It was good work by the team to train the engine and models to be as accurate as they were.
When Amazon decided they wanted to have a speech team, they decided that the Yap team had the skills to acquire and redirected us to building the speech technology for Alexa. Amazon wouldn’t tell us what they had in mind and why they wanted to acquire us. We had no idea what they wanted us to do. They didn’t even tell us until a week after we had been hired. We were all taken to a dimly lit conference room in Seattle where they told us about Alexa. Then we had to essentially start over.
It was a big change to move to Amazon. We were no longer just transcribing people talking to people. We not only had to recognize the words people said, but also understand what they meant. At Yap, people were speaking colloquially to another person. Technically, it was a different type of speech technology. People speak differently to people than they do to machines. That really helped us out. I don’t think the Amazon team understood that it was a different speech technology at the time, but our team was able to make a lot of progress. The main accomplishment at Amazon on the speech technology side was getting speech to be recognized at a distance.
Tell me about your time at Amazon? What did you do there?
We were a team of 12 speech scientists and engineers. Maybe half of us had worked at Nuance and the others at University labs. Many of us had worked together previously. We were also current in the field. Our move from Yap to Amazon was at about the same time that deep neural networks started their rise to prominence for acoustic modeling for speech recognition. Deep neural networks were exactly the breakthrough needed to get the speech recognition at a distance to work. The timing was key. We were already well aware of the techniques and we were able to make use of that in building the speech technology for Amazon.
We worked closely with the folks in Lab126 that worked with the device that became the Amazon Echo and the chipset that became resident on the device. We were in the digital products group, but we were a software group. Our group was primarily split between Cambridge MA and Seattle WA. From the start, I oversaw both the ASR and NLU. Eventually the NLU was spun off and then I left.
What was different about your work at Amazon that what you had done previously?
A lot of people don’t know this, but at Dragon Naturally Speaking there is a whole natural language command interface. You can say things like indent the second paragraph on the the third page. I wasn’t a stranger to the idea of natural language command an control. It is true that the Alexa NLU was much more ambitious and a whole different animal.
Do you ever regret leaving Amazon before Echo and Alexa really took off?
It was perfect timing. I am glad I left at the time I did. I did leave just before it was announced, but it was completed before I left. So, I was able to see it finished. I had Echo in my home and office for a year before it was launched. When I founded Cobalt it was nice to have the publicity associated with Amazon to boost my publicity for Cobalt.
Do you own a Google Home or Amazon Alexa? What skills or voice apps do you use most often?
I have one in almost every room. Mostly I use them for music. I like to listen to music while I am working. My wife uses the Echo as her clock. She doesn’t wear a watch anymore. We check on the weather. I say good morning to her everyday because she gives me a random fact about the day. I like to play Jeopardy every day. I am routinely listening to music and I like that I can ask about the music. Who is singing this song.
What music service do you use?
I have a collection of 20 thousand songs that I have uploaded to Amazon. I am mostly listening to music and playlists I have curated myself.
What does Cobalt do and why did you found the company?
Cobalt does what seems to me like the obvious thing for a company to do but to my knowledge there is no one else doing it. We are a company of speech experts and we say whatever you are looking to do, we will help. A company that is doing something that involves speech and language technology and doesn’t have skills to do it in-house, we will build it for them. We like to think of ourselves as a technology company and not a product company. We build the technology and our customers build the products that use it.
Amazon is a huge company and there are so many corners of Amazon doing so many interesting things that you can imagine that when people found out we had a speech team they were coming to us all of the time. On a regular basis someone would come to me from Amazon asking for help. Of course, we could do it, but we didn’t have time because of our Alexa project. So, I would say, “I am certain you can find some companies to help you.” They would come back and say they couldn’t fine any.
There are thousands of companies that can help you build software. It turns out that speech and language are different. Software developers typically don’t know how to build speech and language solutions. We are that custom software development company that builds bespoke solutions for speech and language projects. We have been keeping ourselves quite busy building a whole lot of things for customers.
Why did you name the company Cobalt?
Cobalt is the 27th element in the periodic table. It is the first metal discovered since prehistoric times. We don’t know when Gold and Copper were discovered. Cobalt was discovered in the early 1800s. When it was discovered, the miners thought is was just like another metal. When they smelted it like other metals, it became white powder. Another scientist discovered you could work with it but it required a different process. Software developers that work with speech face the same problems early metallurgists did with Cobalt. Speech and language software development require a different process.
What is something that most people don’t yet understanding about voice and the likely trajectory of voice assistants?
I don’t want to sell people short, but one of the things that I often find myself having to explain to people is that speech technology is not yet a solved problem. Real speech technology, like being able to have my phone turned on when I am in a crowd and having it recognize what I am saying, is 20 years away. When Siri came out, people thought that speech was a solved technology problem. Siri was a nice accomplishment. but it gave the folks at Amazon the false impression that it was a solved problem. There was still a lot of work to do. Amazon sunk hundreds of millions of dollars into building the models and solving the problems. It’s not something a startup could have done.
Someone will come to me and say, “I’d like to put my phone down on a table and at the end of an hour it will have a transcript of an entire meeting, will know the decisions and action items and send those out to the appropriate people.” We are a long way from this.
We still have a lot of problems when people are speaking to people. People accommodate when they are speaking to a computer. When people to speak to people they use word and sentence fragments. They interrupt each other. It’s a messy problem. And, we aren’t very good at that yet. No one has invested in solving those problems. Inferring meaning from these broken, ill-formed questions is also not a solved problem. We can solve them but no one has really thrown the money behind it for a large group of speech experts to collect and annotate the data and address that challenge. We’ve designed the systems so the problem is tractable so we can understand the speech and know what to do with it. We are still a long way from having computers understand speech at the same level that humans do.
What are you going to present at the upcoming Insight Exchange Network Conference AI and Machine Learning 101?
Speech 101. I am going to take people through the basics of the components of a speech recognizer. I will also talk about what the technology is, what’s approachable and what’s still out of reach, and what we need to do to solve those problems and get good performance.