Speech recognition is one of those technologies which has been around for quite a while, but has not yet found it’s way to large-scale utilization in the industry. Yes, we’ve probably all talked to a computer of the airline reservation center once in our live, but I wouldn’t call that real speech recognition. You will have to choose between a couple of words, and if you say something different, they will redirect you to one of the choices anyway. This is what we call a small vocabulary speech recognition application. It is useful (I guess) but not what I think the best use case of speech technology.
The main problem researchers are facing is that each person’s style of speech is very different. And, especially if more people are in the same conversation, the speech recognition technology should be able to deal with all those different styles and vocabularies. I don’t expect it to take many more years before technology will be able to deal with those complications but there is some good news already!
For some (actually quite a lot) possible applications of speech recognition, you don’t require a 100% accuracy. If it’s not strictly necessary to provide a completely accurate transcription of the speech, the current technology can already be very useful. For example, wouldn’t it be nice to:
– Automatically transcribe your phone calls, meetings, and personal memo’s into text for later reference
– Search through the content of a podcast archive for interesting episodes
– Search through the content of video’s for interesting shows
In order to achieve the latter, Exalead has partnered with LIMSI and developed Voxalead. With Voxalead, you can search through the content of the video’s and directly play the video with the embedded player. The player displays the transcription of text next to the video and clicking on a word brings the video directly to this precise time.
There will definitely be plenty of errors in the transcription, but… it doesn’t matter! The goal is to search through video’s and if some words are not recognized, no problem at all. I think it’s an excellent example of how to utilize the current speech-to-text technology in a useful way. What do you think?