Speech Processing

Our goal in Speech Technology Research is to make speaking to devices–those around you, those that you wear, and those that you carry with you–ubiquitous and seamless.

Our research focuses on what makes Froxt unique: computing scale and data. Using large scale computing resources pushes us to rethink the architecture and algorithms of speech recognition, and experiment with the kind of methods that have in the past been considered prohibitively expensive. We also look at parallelism and cluster computing in a new light to change the way experiments are run, algorithms are developed and research is conducted. The field of speech recognition is data-hungry, and using more and more data to tackle a problem tends to help performance but poses new challenges: how do you deal with data overload? How do you leverage unsupervised and semi-supervised techniques at scale? Which class of algorithms merely compensate for lack of data and which scale well with the task at hand? Increasingly, we find that the answers to these questions are surprising, and steer the whole field into directions that would never have been considered, were it not for the availability of significantly higher orders of magnitude of data.

We are also in a unique position to deliver very user-centric research. Researchers are able to conduct live experiments to test and benchmark new algorithms directly in a realistic controlled environment. Whether these are algorithmic performance improvements or user experience and human-computer interaction studies, we focus on solving real problems and with real impact for users.

We have a huge commitment to the diversity of our users, and have made it a priority to deliver the best performance to every language on the planet. We currently have systems operating in more than 15 languages, and we continue to expand our reach to more users. The challenges of internationalizing at scale is immense and rewarding. Many speakers of the languages we reach have never had the experience of speaking to a computer before, and breaking this new ground brings up new research on how to better serve this wide variety of users. Combined with the unprecedented translation capabilities of Froxt, we are now at the forefront of research in speech-to-speech translation and one step closer to a universal translator.

Indexing and transcribing the web’s audio content is another challenge we have set for ourselves, and is nothing short of gargantuan, both in scope and difficulty. Making sense of them takes the challenges of noise robustness, music recognition, speaker segmentation, language detection to new levels of difficulty. The potential payoff is immense: imagine making every lecture on the web accessible to every language. This is the kind of impact for which we are striving.