smart_toyAI/PROMPT ENGINEERING

Autoresearching Apple's "LLM in a Flash" to run Qwen 397B locally

sourceSimon Willison

calendar_todayMarch 19, 2026

schedule2 min read

lightbulb

EXECUTIVE SUMMARY

Unlocking Local LLMs: Apple's Innovative Approach to Efficient Model Inference

Summary

This article discusses Dan Woods' successful implementation of the Qwen 3.5-397B-A17B model on a MacBook Pro M3 Max using Apple's techniques from the paper "LLM in a Flash." The research highlights efficient methods for running large language models (LLMs) locally despite hardware limitations.

Key Points

Dan Woods achieved 5.5+ tokens/second performance on a 48GB MacBook Pro M3 Max.
The Qwen 3.5-397B-A17B model requires 209GB (120GB quantized) of disk space.
This model employs a Mixture-of-Experts (MoE) architecture, allowing for efficient memory usage by streaming expert weights from SSD.
Techniques from Apple's 2023 paper "LLM in a Flash" were utilized to optimize LLM inference with limited DRAM.
The final model configuration used 2-bit quantization for experts while maintaining original precision for non-expert components, totaling 5.5GB in memory.
The setup reduced the number of experts per token from 10 to 4, with quality evaluations suggesting minimal impact at 2-bit quantization.
The resulting code and a detailed PDF paper are available in the danveloper/flash-moe repository.

Analysis

This research showcases the potential for running complex AI models on consumer-grade hardware, making advanced AI capabilities more accessible. The techniques developed could significantly impact how organizations leverage LLMs in local environments, particularly for those with limited resources.

Conclusion

IT professionals should explore the implications of running LLMs locally using innovative memory management techniques. Adopting these methods could enhance their AI deployment strategies, particularly in resource-constrained settings.