smart_toyAI/PROMPT ENGINEERING

Gemma 4 audio with MLX

sourceSimon Willison

calendar_todayApril 13, 2026

schedule2 min read

lightbulb

EXECUTIVE SUMMARY

Transcribing Audio with Gemma 4 and MLX: A Practical Guide

Summary

This article provides a detailed guide on using the Gemma 4 E2B model with MLX for transcribing audio files on macOS. It includes a specific command recipe for implementation and shares results from a test transcription.

Key Points

The Gemma 4 E2B model is a 10.28 GB audio processing model.
The transcription process utilizes the MLX framework and the mlx-vlm library.
A sample command for transcription is provided: `uv run --python 3.13 --with mlx_vlm --with torchvision --with gradio mlx_vlm.generate --model google/gemma-4-e2b-it --audio file.wav --prompt "Transcribe this audio" --max-tokens 500 --temperature 1.0`.
A test was conducted on a 14-second .wav file, revealing some transcription inaccuracies.
The output of the transcription included misinterpretations, highlighting the challenges in audio recognition.
Tags associated with the article include uv, mlx, ai, gemma, llms, speech-to-text, python, and generative-ai.

Analysis

The use of the Gemma 4 model with MLX represents a significant advancement in audio transcription technology, particularly for developers and IT professionals interested in implementing speech-to-text solutions. The article emphasizes the practical application of machine learning models in real-world scenarios, showcasing both their capabilities and limitations.

Conclusion

IT professionals should consider experimenting with the Gemma 4 E2B model and MLX for audio transcription tasks, while being mindful of potential inaccuracies. Continuous testing and refinement of prompts may enhance transcription quality in future applications.