SpeechTEK Europe 2011 - Dave Burke Interview

SpeechTEK Europe 2011 - The Voice Solutions Showcase

25 - 26 May 2011 • Copthorne Tara Hotel • London, UK

SpeechTEK University • 24 May 2011

Dave Burke, Engineering Director at Google

Dave Burke, Engineering Director at Google, gives the opening Keynote Address at SpeechTEK Europe in London this May. In an exclusive interview with SpeechTEK Europe Dave describes some of the challenges Google faces in developing speech technologies for Android and Chrome; why smartphone users are set to embrace speech apps; and the other speech technologies Google is working on now.

Look out for the second part of our interview with Dave Burke later this month...

What’s your vision for the mobile device?

The smartphone is the defining, iconic product of our time. The shift to mobile computing is happening fast - mobile web adoption in the US is ramping up 8 times faster than the desktop did in the mid-1990s. In two years, we will reach an inflection point where the number of smartphones will outstrip the number of PCs sold. And mobile computing isn't just about the devices - it's about pervasive connectivity. Your phone is connected to the Internet 24x7, no matter where you are. This will have a profound effect on the way people will access information and services. Dialogue-based services will cede to rich interactive experiences in the form of web and packaged applications. Speech capabilities will become part of the applications themselves, working in harmony with other input modalities such as touchscreens.

Android phones already support speech technology. What new capabilities will be enabled on Android phones by the work your group is doing?

We've been working hard to enhance the speech capabilities on Android with each new release of the platform. We started out by building voice search - the ability to speak web search queries - in the "Cupcake" version of Android in 2009. In a subsequent release, we added the capability to use speech for any text input box by adding a microphone to the keyboard. Last year, we launched a new set of capabilities we call Voice Actions, starting with US English. The idea behind Voice Actions is to take common but complex multi-step actions and speech-enable them. Examples include commands for sending text messages and emails, calling a business or contact, listening to music, or setting the alarm.

One of the nice things about the Android platform is that pretty much every core function has associated developer APIs. Speech is no different, and we've been continuing to enhance the APIs with each new release. Android supports simple primitives called intents that allow you to quickly speech-enable your application as well as more advanced APIs for lower-level control. We plan to continue to enhance speech in Android in future releases. Expect to see more capabilities, more languages, faster and more accurate recognition and synthesis as well as API improvements.

What speech functions will be embedded into the Android platform and what speech functions will be available on the cloud?

I think this is an interesting question. Embedded technologies promise faster response times and are immune to patchy network conditions. On the other hand, cloud-based services offer significantly larger and more sophisticated language and acoustic models for speech recognition and higher quality text-to-speech. Coming up with the right hybrid strategy is still an unsolved problem in my view and something which, if done right, may greatly enhance the user experience.

Why should applications developers create speech application on the Android platform rather than other smart phone platforms?

Android is unique in that it provides speech APIs built into the platform for high quality speech recognition and synthesis. Any OEM can build Android-based phones with inbuilt speech capability and any developer can add speech features to their application at no cost.

Why will Smart phone users embrace speech applications when users are already adept at entering information by touching the Android screen?

Speech offers an adjunctive modality for the modern smartphone. Speech technologies afford the user the ability to attend to the device far less than traditional user interface techniques. For example, it is now possible to quickly and effortlessly send a text message by voice while walking down the street. There are other situations, such as in-car, where a hands-free mode is obviously beneficial. And despite great improvements in touch screens and keyboards, it is still often quicker to speak rather than type - for example, in long web queries or performing multi-step actions like setting an alarm. I don't believe speech will ever replace other input modalities entirely - there are plenty of situations where it is inappropriate or less effective, like at a rock concert or in a library! But there are also plenty of situations where having the ability to quickly issue queries and instructions by voice greatly enhances the user interface of the device.

What other speech technologies is Google working on?

One platform which is still conspicuously missing widely-deployed speech capabilities is the Web. Last year, Google came together with other companies including Microsoft, Nuance, and Voxeo to start a new Speech XG Incubator Group within the World Wide Web Consortium (W3C). The Speech XG group is focused on extending HTML 5 with the ability to leverage speech recognition and synthesis from within Web browsers. The goal is to make it very straightforward for Web developers to speech-enable their applications.

Why will Android developers embrace Speech XG proposals when the SALT (Speech Application Language Tags) technology from IBM was largely ignored?

SALT was envisaged in a time when dialogue languages were the focus of the industry. As a result, SALT tries to be "all things to all men" - both a dialogue language as well as an extension to HTML. With the SpeechXG work, we're focussing just on the latter, embodying customs and conventions familiar to modern Web developers.

The timing of the Speech XG work is also notable. SALT was conceived at a time when there was little innovation happening in the Web platform. Today, the environment is quite different. Fueled by efforts in the WhatWG group and W3C, and under the broad moniker of HTML 5, new APIs are being actively worked on that greatly improve the power of the Web platform. Examples include offline storage, worker threads, audio and video, geolocation, device orientation, notifications, and camera access. Thanks to innovative browsers with fast development cycles such as Firefox and Chrome, these new capabilities are getting into the hands of Web developers - and ultimately users - faster than ever.

Dave Burke's SpeechTEK Europe Keynote Address - Cloud-based Speech Recognition for Mobile and the Web - takes place on Wednesday 25 May 2011.

SpeechTEK Europe features over 50 speakers from around the world, and from a wide range of business environments including Google, Barclays Bank, Deutsche Telekom, Nuance, Loquendo, Openstream, Voxeo, Belgian Railways, Telecom Italia and Cable & Wireless.

Read more about the SpeechTEK Europe 2011 programme.