Proof in Production: Evaluating Effectiveness of SWE Agents with A/B Tests

SWE-Agents are setting new benchmark records, yet underperform in the real world. Beneath the hype, only A/B testing can prove their effectiveness.

Sep 24, 2025 · 5 min read · Originally on Substack ↗

AI Agents SWE A/B Testing Evaluation

We’re entering an era where Software Engineering (SWE) agents are beginning to automate parts of the software development lifecycle (SDLC). It isn’t trivial… it spans planning, analysis, design, development, testing, deployment, and maintenance. Not to mention the entire process involves many people, incomplete context, competing priorities, budget constraints and more.

The vision is bold. But a fundamental challenge remains: how do we know if these agents are truly effective? Today, most evaluations lean on static benchmarks. I argue that this is insufficient, and that A/B testing is the gold standard for evaluating SWE-Agents and de-risking its developments.

Tracking SWE-Bench scores over the last 2 years. It's the most popular benchmark used to evaluate SWE-Agents.

Limitations of Static Benchmarks

In the early days of SWE-Agents, benchmark scores demonstrated that parts of software engineering could be automated. As performance improved, the path to productization emerged: new methods are first evaluated on benchmarks—the rough equivalent of unit tests in engineering—then QA’ed internally before being shipped to production.

But once in production, the limitations of static benchmarks become clear:

Generalizability: Benchmarks typically cover narrow tasks such as bug fixing or unit test generation. Real SWE involves much more: debugging infrastructure, integrating CI/CD pipelines, refactoring large systems, etc.
Gamification: Optimizing agents to excel at specific benchmarks (e.g., SWE-Bench bug fixing) often leads to inflated scores without corresponding gains in other SWE tasks. Data contamination can also inflate scores similarly.
Interaction complexity: Multi-turn collaboration with humans, diverse environment setups, and heterogeneous codebases are hard to simulate. It’s tough to capture the variability and messiness of real-world contexts.
Practical trade-offs: Engineering decisions often involve short-term compromises, like temporarily creating technical debt or working under strict budget constraints, that benchmarks don’t reflect.
Fragility: Seemingly minor changes (e.g., a prompt tweak) may not shift benchmark scores but can drastically affect user experience. For example, an agent that becomes overly verbose may technically “pass” but frustrate developers in practice.

So why not build better benchmarks?

Efforts like VersaBench represent an important step by compiling a broader set of SWE tasks. But benchmarks still face inherent constraints:

They must remain relatively small and cost-effective, so evaluations can be repeated regularly. Attempting to model the full complexity of SWE would make them prohibitively expensive and still likely incomplete.
They cannot model user preferences. Subjective usability constraints often determine success, but are nearly impossible to encode in a static dataset.
The definitions for success differ and change. Benchmarks typically capture one slice of quality (e.g., correctness), while real-world engineering balances multiple goals (speed, maintainability, readability, etc).

The dilemma, however, is that you cannot confidently determine if an agent is production-ready until it’s already in production. The challenge becomes how do you manage risk? How do you confirm that improvements on benchmarks correlated to tangible improvements in production?

A/B Testing as Risk Management

If we agree that benchmarks are not the end-all-be-all and we’re looking to manage risk in production, the solution involves:

Benchmarks: The first safety net
Manual QA: Vibe check for obvious things benchmarks fail to capture
A/B Tests: The mechanism for detecting regressions and validating wins in live workflows

Designing A/B tests for SWE-Agents is non-trivial, but it is rooted in the scientific method, making it the gold standard for evaluating agents in the wild.

Why is this urgent?

In traditional product development, the product itself is (mostly) deterministic and the variability comes from users. This is why experimentation became essential—humans are notoriously unpredictable, and A/B testing helps separate good bets from costly mistakes.

With SWE-Agents, the risk compounds. Not only are users non-deterministic, but the agents themselves are too. Their behavior can vary run-to-run, even on identical inputs. On top of that, agents operate across a layered stack—prompts, tools, workflows, memory management—each of which can introduce variability. A small change may ripple into radically different agent behaviors. Therefore, experimentation is a way to manage risk and identify which changes in the agent stack actually improve performance.

Experimentation = Observability

Experimentation doesn’t just validate improvements, it doubles as a new kind of observability. For SWE-Agents, infrastructure observability does not mean product efficacy observability. A healthy and operational agent may not produce high quality code.

Luckily, another bonus of investing in experimentation is that tracking relevant metrics for test outcomes can also provide observability into product efficacy.

Infrastructure observability: System health (latency, token usage, etc)
Product efficacy observability: Agent is actually useful (minimally invasive user interactions, PR acceptance, etc)

If tomorrow a model provider aggressively quantized their model and served it on the same API, it may bubble up downstream via degraded product efficacy.

Towards Credible Impact

In the end, leaderboard wins don’t equal real-world impact. We’re trending towards flashy in benchmarks, and fragile with real users. Intentional strategy around experimentation can help improve agents, provide credible evidence for stakeholders, and prevent teams from deploying blind.

Thanks for reading!

Citation

@online{malhotrarohit2025blog,
  author = {Rohit Malhotra},
  title = {Proof in Production: Evaluating Effectiveness of SWE Agents with A/B Tests},
  year = {2025},
  month = {Sep},
  day = {24},
  url = {https://malhotra5.github.io/blog/proof-in-production/},
  urldate = {2026-06-10},
  note = {Blog post},
  organization = {Rohit Malhotra's Blog}
}