radar

ONE Sentinel

smart_toyAI/PROMPT ENGINEERING

Gemma 4 audio with MLX

sourceSimon Willison
calendar_todayApril 13, 2026
schedule2 min read
lightbulb

EXECUTIVE SUMMARY

Transcribing Audio with Gemma 4 and MLX: A Practical Guide

Summary

This article provides a detailed guide on using the Gemma 4 E2B model with MLX for transcribing audio files on macOS. It includes a specific command recipe for implementation and shares results from a test transcription.

Key Points

  • The Gemma 4 E2B model is a 10.28 GB audio processing model.
  • The transcription process utilizes the MLX framework and the mlx-vlm library.
  • A sample command for transcription is provided: `uv run --python 3.13 --with mlx_vlm --with torchvision --with gradio mlx_vlm.generate --model google/gemma-4-e2b-it --audio file.wav --prompt "Transcribe this audio" --max-tokens 500 --temperature 1.0`.
  • A test was conducted on a 14-second .wav file, revealing some transcription inaccuracies.
  • The output of the transcription included misinterpretations, highlighting the challenges in audio recognition.
  • Tags associated with the article include uv, mlx, ai, gemma, llms, speech-to-text, python, and generative-ai.

Analysis

The use of the Gemma 4 model with MLX represents a significant advancement in audio transcription technology, particularly for developers and IT professionals interested in implementing speech-to-text solutions. The article emphasizes the practical application of machine learning models in real-world scenarios, showcasing both their capabilities and limitations.

Conclusion

IT professionals should consider experimenting with the Gemma 4 E2B model and MLX for audio transcription tasks, while being mindful of potential inaccuracies. Continuous testing and refinement of prompts may enhance transcription quality in future applications.