smart_toyAI/PROMPT ENGINEERING

microsoft/VibeVoice

sourceSimon Willison

calendar_todayApril 28, 2026

schedule1 min read

lightbulb

EXECUTIVE SUMMARY

Unlocking Speech-to-Text: Microsoft’s VibeVoice Revolutionizes Audio Transcription

Summary

Microsoft's VibeVoice is a new audio model for speech-to-text conversion, featuring built-in speaker diarization. Released on January 21, 2026, this tool aims to enhance audio transcription capabilities for various applications.

Key Points

Product: VibeVoice, a Whisper-style audio model by Microsoft.
Release Date: January 21, 2026.
License: MIT licensed.
Model Size: Requires 5.71GB for the mlx-community/VibeVoice-ASR-4bit conversion of the 17.3GB model.
Performance: Processes an hour of audio in approximately 8 minutes and 45 seconds on a 128GB M5 Max MacBook Pro.
Memory Usage: Peak memory usage reported at 30.44GB, with observed usage reaching 61.5GB during prefill.
Audio Formats Supported: Works with both .wav and .mp3 files.
Token Limit: Default max-tokens is 8192, sufficient for about 25 minutes of audio.

Analysis

VibeVoice represents a significant advancement in speech-to-text technology, particularly with its built-in speaker diarization feature. This capability allows for more accurate transcriptions in multi-speaker environments, making it a valuable tool for podcasters, researchers, and businesses looking to automate transcription tasks.

Conclusion

IT professionals should consider integrating VibeVoice into their workflows for improved audio transcription efficiency. Additionally, testing with various audio formats and managing memory usage will be crucial for optimal performance.