SWE-bench February 2025 leaderboard update
EXECUTIVE SUMMARY
February 2025 SWE-bench Leaderboard Reveals Top AI Coding Models
Summary
The February 2025 update of the SWE-bench leaderboard highlights the performance of various AI coding models against a dataset of real-world coding problems. This update is significant as it provides an independent assessment of model capabilities, contrasting with self-reported results from labs.
Key Points
- SWE-bench runs a "Bash Only" benchmark using a mini-swe-bench agent with ~9,000 lines of Python code.
- The benchmark evaluates models against 2,294 coding problems sourced from 12 open source repositories, including django/django (850 problems) and scikit-learn/scikit-learn (229 problems).
- Claude Opus 4.5 ranks first, narrowly beating Opus 4.6 by about one percentage point.
- Other top performers include Gemini 3 Flash and MiniMax M2.5, a 229B model released by MiniMax.
- OpenAI's GPT-5.2 is the highest performing model from OpenAI, ranking 6th, while GPT-5.3-Codex is not included in the results.
- The benchmark uses a consistent system prompt for fair comparison, though it does not measure the quality of different harnesses or optimized prompts.
Analysis
The SWE-bench leaderboard serves as a crucial resource for IT professionals, providing objective insights into the performance of AI coding models. This independent evaluation can guide decision-making regarding the adoption of AI tools in software development.
Conclusion
IT professionals should consider leveraging the insights from the SWE-bench leaderboard to inform their choices in AI coding models, particularly when evaluating performance and capabilities in real-world scenarios.