I’ve been heads-down with Tempo AI and so I haven’t been able to write too much. There is clearly a lot of fever about Google Glass. Everyone I know wants to try it and with KP, Google Ventures and A16Z starting a Google Glass Venture Fund and with some other buddies starting a Google Glass Incubator called Stained Glass Labs, you know you have reached a tipping point!
My selfish interest in Glass is how we can use it with Tempo to be even more anticipatory and contextual but my curious interest is actually related to voice. There are significant misunderstandings as to how voice to text works, the technical library is known as the ASR (audio speech recognizer). For a machine to accurately translate voice to text, it needs access to millions of samples of human voice to improve the statistical models it’s built-upon. This is why each time you speak into your phone, TV, computer, those utterances (what they are technically called) are sent to a server, stored and usually manually listened-to and transcribed, to continue to improve the algorithm. On this note, I think Siri/Apple recently said they store your voice samples for 2 years and probably for this purpose.
The big challenge is the input. An utterance over the voice network (eg a 1-800 number) versus your phone application over the data network versus your car Bluetooth are different. They have different microphone qualities and they have different sorts of background noise. This is why Microsoft was rumored for many years to have the better ASR for voice calls because of Tellme but Google to be better for digital (voice via an application on the data network) because of Android (and both by the way behind market leader Nuance). This potentially means that you need millions of utterances from each unique microphone, place setting, network, device etc. It’s a lot of work and it’s taken significant engineering investment to even get to where we are today.
To make voice to text better, applications will often couple it with other fuzzy technologies. For example, Siri is rumored to post-correct the voice-to-text using NLP (natural language processing). This means if the output of your utterance is “What is the park,” Siri might post-correct this to “Where is the park” realizing that the first output was grammatically incorrect. This is in part why Siri was such a technological achievement – it was able to take garbage-in and still return a meaningful result.
With Glass, Google is collecting a new set of utterances that has never previously existed. The microphone is on your face and so this will yield millions of new utterances to build upon. But what gets me really excited is that the Glass rests on your face and so Google could potentially improve the voice-to-text by coupling it with the accelerometer. As you move your mouth, the Glass will move ever so slightly. I don’t know if this movement is significant enough to be measured or more just noise and whether it needs to be measured while sitting, walking, running etc. But if there is enough variance to derive patterns, they can effectively use the accelerometer in Glass to post-correct the voice-to-text. It’s the equivalent of a machine “lip-reading” and could potentially be more accurate than even the voice-to-text itself.