smart_toyAI/PROMPT ENGINEERING

Streaming experts

sourceSimon Willison

calendar_todayMarch 24, 2026

schedule2 min read

lightbulb

EXECUTIVE SUMMARY

Revolutionizing AI Efficiency: Streaming Experts Technique Gains Traction

Summary

The article discusses the innovative approach of using streaming experts in large Mixture-of-Experts models, allowing for efficient processing on hardware with limited RAM. It highlights recent experiments by Dan Woods and others utilizing this technique on various devices.

Key Points

Dan Woods is experimenting with streaming experts to optimize large AI models.
The technique allows models to run on hardware with insufficient RAM by streaming expert weights from SSD.
Recently, Dan ran the Qwen3.5-397B-A17B model using 48GB of RAM.
A user named @seikixtc successfully ran the Kimi K2.5 model, a 1 trillion parameter model, with 32 billion active weights in 96GB of RAM on an M2 Max MacBook Pro.
Another user, @anemll, demonstrated the Qwen3.5-397B-A17B model running on an iPhone at a rate of 0.6 tokens/second.
The ongoing autoresearch loops aim to find further optimizations for these models.

Analysis

The significance of this approach lies in its potential to enhance the performance of large AI models on consumer-grade hardware, making advanced AI more accessible. As the demand for AI capabilities grows, techniques like streaming experts could lead to more efficient resource utilization and broader applications.

Conclusion

IT professionals should explore the implications of streaming experts for deploying large AI models on limited hardware. Staying updated on these developments can help in optimizing AI solutions for various applications.