radar

ONE Sentinel

smart_toyAI/PROMPT ENGINEERING

SWE-bench February 2025 leaderboard update

sourceSimon Willison
calendar_todayFebruary 19, 2026
schedule2 min read
lightbulb

EXECUTIVE SUMMARY

February 2025 SWE-bench Leaderboard Reveals Top AI Coding Models

Summary

The February 2025 update of the SWE-bench leaderboard highlights the performance of various AI coding models against a dataset of real-world coding problems. This update is significant as it provides an independent assessment of model capabilities, contrasting with self-reported results from labs.

Key Points

  • SWE-bench runs a "Bash Only" benchmark using a mini-swe-bench agent with ~9,000 lines of Python code.
  • The benchmark evaluates models against 2,294 coding problems sourced from 12 open source repositories, including django/django (850 problems) and scikit-learn/scikit-learn (229 problems).
  • Claude Opus 4.5 ranks first, narrowly beating Opus 4.6 by about one percentage point.
  • Other top performers include Gemini 3 Flash and MiniMax M2.5, a 229B model released by MiniMax.
  • OpenAI's GPT-5.2 is the highest performing model from OpenAI, ranking 6th, while GPT-5.3-Codex is not included in the results.
  • The benchmark uses a consistent system prompt for fair comparison, though it does not measure the quality of different harnesses or optimized prompts.

Analysis

The SWE-bench leaderboard serves as a crucial resource for IT professionals, providing objective insights into the performance of AI coding models. This independent evaluation can guide decision-making regarding the adoption of AI tools in software development.

Conclusion

IT professionals should consider leveraging the insights from the SWE-bench leaderboard to inform their choices in AI coding models, particularly when evaluating performance and capabilities in real-world scenarios.