We use cookies for marketing and profiling.
Google releases latest update AI speech model
Back
One of the many products that Google is developing is a speech model (Universal Speech Model - USM), a model that should be able to translate spoken language 'live'. By using artificial intelligence (AI), the model should eventually be able to understand and translate 1000 languages. The recently reported, latest update has now trained more than 300 languages. The model can perform automatic speech recognition on common languages such as English and Mandarin, as well as languages such as Punjabi, Balinese, Shona, Malagasy, Xhosa, and Lingala, to name a few.
There are two major challenges in creating a speech model:
- The first is getting enough data to properly train AI. The current model is trained with over 12 million hours of conversations and 28 billion sentences of text. Some languages are spoken by fewer than 20 million people, making it very difficult to find training data on YouTube.
- Second, the learning algorithm must be flexible enough to take large amounts of data from different sources and apply it to new languages and use cases. Initially, the model was designed to create subtitles on YouTube videos and do automatic speech recognition in 100 languages, but it should also be able to be used for new applications and other languages.
Today’s situation
The most recent YouTube subtitle model achieved a 30% error rate in 73 languages. But with only 90,000 hours of training data, this model already outperforms Whisper, a general-purpose system trained on over 400,000 hours of data and a 40% error rate.
But it seems that there is much to do before the error rate falls below 3%, for example, and the model can be used. But Google has shown that it can effectively leverage its training process to adapt to new languages and data. It is now a matter of practicing a lot, but then it can go very quickly.