Mistral: Voxtral TTS, Forge, Leanstral, & what's next for Mistral 4 — w/ Pavan Kumar Reddy & Guillaume Lample
Latent Space: The AI Engineer Podcast
Mistral AI's release of Voxtral TTS, their first speech generation model, is the central focus, with Guillaume Lample and Pavan Kumar Reddy from Mistral detailing its architecture and capabilities. The model supports nine languages, is cost-effective, and uses a novel autoregressive flow matching architecture with a new neural audio codec. Pavan explains the differences between audio understanding and generation models, highlighting the use of latent tokens for converting audio. The discussion explores the potential of flow matching in audio, drawing parallels with image processing techniques, and addresses the challenges of real-time audio generation and evaluation. They also emphasize the importance of fine-tuning models with customer data to leverage domain-specific knowledge, and the company's commitment to open-source AI.
Part 1: Voxtral TTS, Architecture, Methods
00:05Introducing Voxtral TTS: Mistral's New Speech Generation Model
Introducing Voxtral TTS: Mistral's New Speech Generation Model
Mistral is releasing Voxtral TTS, their first speech generation audio model, supporting nine languages. This follows their previous audio model, Voxtral, a transcription model with updates including more languages and features like context biasing and real-time transcription. The new TTS model is small and fast, performing at the same level as larger models but at a fraction of the cost. While the model will not be open-weighted this time, Pavan notes it features a novel in-house architecture involving an autoregressive flow matching architecture and a neural audio codec.
02:26Audio vs. Language Models: Novel Encodings and the Absence of a "Winner Model"
Audio vs. Language Models: Novel Encodings and the Absence of a "Winner Model"
Pavan discusses the differences between audio understanding and audio generation models, explaining how audio is fed into the model. Audio understanding models use an audio encoder to produce continuous embeddings, similar to image models. The interesting part of audio generation is producing audio output using a neural audio codec that converts audio into latent tokens. Unlike text, there is no "winner model" in audio, making it an exciting space to explore.
07:25Real-Time Audio Generation with Flow Matching for Voice Agents
Real-Time Audio Generation with Flow Matching for Voice Agents
The team chose an autoregressive approach for real-time streaming applications like voice agents. They opted to add audio as another head to their regular transformer decoder model for easier end-to-end modeling. Flow matching was chosen over discrete diffusion for its superior performance. Flow matching models the distribution of predictions at a particular time step, accommodating inflections and pronunciations. Disfluencies, such as "ums" and "ahs," contribute to the entropy that flow matching helps model.
12:33Leveraging Vision Community Learnings for Efficient and Cost-Effective Audio Models
Leveraging Vision Community Learnings for Efficient and Cost-Effective Audio Models
Guillaume notes that the vision community has done much more work than the audio community, and there are many learnings to apply to improve audio models further. Mistral aims to create efficient models tailored to specific use cases, offering cost-effective alternatives to general models. They've had many customers asking for voice, and Guillaume notes that even simple tasks like transcription are not yet as natural as human conversation, especially in languages other than English.
Part 2: Enterprise Solutions, Customization, Voice Cloning
15:06The Future of Voice Agents and the Value of Fine-Tuning on Proprietary Data
The Future of Voice Agents and the Value of Fine-Tuning on Proprietary Data
Pavan reflects on the progress made in voice assistants, noting that despite advancements, there's still a gap between speaking to an agent and a person. Guillaume emphasizes that voice is the next step in making interactions with agents more natural and efficient. Mistral helps customers deploy models in-house due to privacy concerns and to leverage their proprietary data for fine-tuning. Fine-tuning models on company-specific data leads to significant improvements, as the model becomes trained on the company's knowledge.
20:44Tailored AI Solutions: Custom Models and Voice Personalization for Enterprises
Tailored AI Solutions: Custom Models and Voice Personalization for Enterprises
Mistral offers tailored solutions for customers, including training new models with specific language mixes and dialects. They also build custom models for specific use cases, such as offline audio models for cars. Customers often seek Mistral's help after prototyping with closed-source models due to high costs. Mistral provides open-weight models for customers to customize themselves. Pavan notes that voice personalization and adaptation are good use cases for fine-tuning, allowing enterprises to create customized voices that reflect their brand and safety considerations.
25:11Voice Cloning and Long-Form Coherent Text-to-Speech Generation
Voice Cloning and Long-Form Coherent Text-to-Speech Generation
Pavan states that the main use case for voice cloning is enterprise personalization, allowing companies to customize their voice assistants. He shares that a technical report will be released with details on the new speech-to-text model. The model processes audio at 12.5 hertz, allowing for long context windows and coherent generation. The techniques used are similar to text long context modeling, with the key difference being flow matching instead of text-driven prediction.
Part 3: Model Strategy, Open Source, Reasoning
28:49Mistral Small: Merging Individual Capabilities into a Sparse Mixture of Experts
Mistral Small: Merging Individual Capabilities into a Sparse Mixture of Experts
Guillaume explains that Mistral Small is a model that combines different models, merging capabilities like function coding. He notes that OpenAI may be moving away from the Omni model vision, but Mistral may pursue it. Guillaume states that it makes no sense to use a large model for simple tasks like transcription, and that separate models are more efficient. He also notes that there are many capabilities that people don't talk about, but that are important, such as computer-aided design.
32:24Joining Voice with Video and Mistral's Commitment to Open Source
Joining Voice with Video and Mistral's Commitment to Open Source
Swyx notes that joining voice with video is an important trend, especially considering spatial audio. Guillaume emphasizes that open source is core to Mistral's DNA, and that they have been open sourcing since the beginning. He notes that open source allows for faster progress and contributes to the ecosystem. Mistral wants to contribute to the open source community and does not want the best models to be only accessible to companies.
35:41Lintral and Formal Proving: Reasoning and Software Verification
Lintral and Formal Proving: Reasoning and Software Verification
Guillaume introduces Lintral, a step in the direction of open source, and discusses the team's work on formal proving and math. He explains that formal proving is useful in software verification, and that it is a small community of people doing a PhD on that. Guillaume notes that with coding agents, the industry will be much larger in the future. Swyx suggests that proof takes so long, it's a proxy for long horizon reasoning and coherence and planning.
Part 4: Research Frontiers, Hiring, Engineering Roles
40:02Frontiers in Foundation Model Training: Pre-training and Long Trajectories
Frontiers in Foundation Model Training: Pre-training and Long Trajectories
Vibhu asks about the frontiers of research in foundation model training. Guillaume states that they are still working a lot on the pre-training side, and that ML4 pre-training will be a big step up. He notes that they are building infrastructure that will anticipate extremely long scenarios on the up-priced data game. Swyx notes that the team is doing incredible work and has laid out an impressive vision for open source and voice.
42:26Mistral is Hiring: AI for Science and Forward Deployed Engineers
Mistral is Hiring: AI for Science and Forward Deployed Engineers
Guillaume states that they are hiring a lot of people in their science team, and that they are hiring in all their offices. He notes that they are exploring AI for science, and that there are a lot of areas where they think that you could get some extremely promising results. Guillaume states that a good forward deployed engineer needs to be familiar with the tech, know how to do fine-tuning, and know how to start some RL pipeline.
45:14The Diversity of Forward Deployed Engineering and the Full Circle System
The Diversity of Forward Deployed Engineering and the Full Circle System
Pavan notes that the diversity of the work that forward deployed engineers do always surprises him. Guillaume states that they are doing many things, and that their customers are always very extremely careful about their data. Swyx notes that there can be many orders of magnitude more FDEs than research scientists. Pavan notes that the real cases are very diverse, and that the applied scientist engineers will go and make it better and then from the learnings they incorporate it into the base model itself.
Sign in to continue reading, translating and more.
Open full episode in Podwise