Meta’s New AI Transcribes and Translates Nearly 100 Languages

Meta recently introduced SeamlessM4T, a single AI model capable of speech and text translations in multiple languages.
 

What is SeamlessM4T?

 
SeamlessM4T is the first of its kind. It’s a multimodal AI model that caters to a wide range of translation and transcription needs:

  • Speech recognition for almost 100 language
  • Translating from speech to text in nearly 100 input and output languages
  • Converting speech-to-speech in nearly 100 input languages and 36 (including English) output languages
  • Text-to-text translation for nearly 100 languages
  • Translating text to speech in almost 100 input languages and 35 (including English) output languages

Meta’s aim is to make people communicate effortlessly through speech and text across different languages. Turning to literature for inspiration, the origins of SeamlessM4T might surprise you.
 

From Babel Fish to Reality

 
SeamlessM4T is fiction-inspired. The Babel Fish from “The Hitchhiker’s Guide to the Galaxy.” In reality, “existing speech-to-speech and speech-to-text systems only cover a small fraction of the world’s languages.”

With this model, Meta aims to address this gap. The uniqueness lies in SeamlessM4T’s single system approach which they say “reduces errors and delays, increasing the efficiency and quality of the translation process.” This improves the way people from different language backgrounds communicate.

This isn’t Meta’s first venture into the world of translation and linguistic technology.
 

 

A Legacy of Advancements

 
Meta has a history of striving towards creating a universal translator. Previously, they launched ‘No Language Left Behind’, a text-to-text machine translation model for 200 languages, and it’s now even integrated into Wikipedia.

They’ve also released Universal Speech Translator, a first of its kind for Hokkien, a language without a common writing system. Plus, they introduced Massively Multilingual Speech which offers speech recognition and synthesis across over 1,100 languages.
 

How Does it Work?

 

Understanding Speech:

 
Meta’s self-supervised speech encoder, known as w2v-BERT 2.0, analyses millions of hours of multilingual speech. It breaks down the audio signal and forms an understanding of the content.
 

Processing Text:

 
The NLLB model, a previous release, forms the basis for the text encoder. It’s trained to understand nearly 100 languages and produce meaningful representations for translation.
 

Producing Text and Speech:

 
The team trained the text decoder to take encoded speech or text representations and transform them. A multilingual HiFi-GAN unit vocoder is then used to convert these units into audio.
 

Data Scaling and Results

 
SeamlessM4T uses large datasets for optimal functioning. With data scaling, Meta created the largest open speech/speech and speech/text parallel corpus named SeamlessAlign. SeamlessAlign has more than 443,000 hours of speech in its total volume.

When it comes to performance, SeamlessM4T sets a new standard. It achieves top-tier results for almost 100 languages and multitask support across multiple functionalities – all with a single model.
 

Safety First

 
Meta prioritises the accuracy of translation systems. They understand the risks of mistranscription or generating outputs that could be toxic or incorrect.

In their words, “we conducted research on toxicity and bias to help us understand which areas of the model might be sensitive.” They have implemented a rigorous toxicity classifier to ensure that any harmful content is filtered out, making it a safer tool for users.
 

Sharing the Technology

 
True to its commitment to open science, Meta is sharing this revolutionary model with the public. Their vision to bring the world closer together is a priority, and their overall quest to connect people could be great for us all.

In the words of the Meta team: “This is only the latest step in our ongoing effort to build AI-powered technology that helps connect people across languages.” This is an innovation that will help people understand and be understood, irrespective of the language they speak.