Why AI Agents Need a Real Eval and QA Toolset Before They Touch Production

The Promise and the Problem

The pitch for autonomous AI coding agents is compelling: give an agent a task, walk away, come back to working code. Teams are already doing this — shipping features, fixing bugs, refactoring legacy systems — with agents handling work that used to take days in hours.

But speed without validation is just faster failure.

The uncomfortable truth that is emerging across every engineering team that has moved beyond toy demos is this: agents make confident mistakes. They produce code that compiles, passes surface-level checks, and completely violates the business logic it was supposed to implement. They introduce regressions that no unit test catches because the unit tests were never written for that edge case. They hallucinate API behaviour that does not exist. They satisfy the letter of a ticket while missing the spirit of the system entirely.

And they do all of this without any of the social friction that slows a human engineer down — no hesitation, no "I'm not sure about this one", no sense that something feels off.

The agent does not know what it does not know. That is your problem to solve.

What Makes Agent Output Different From Human Output

Before designing an eval and QA toolset, it is worth understanding why standard software testing is insufficient for agent-generated code.

Non-determinism at the source

Human engineers make mistakes in predictable patterns. Agents fail in the full probability distribution of their training data. The same prompt, run twice, may produce subtly different code — one version correct, one version silently wrong. Your QA toolset must handle this stochastic reality, not assume reproducibility.

Context blindness at scale

An agent working on a single file may not understand how that file fits into the broader system. It optimises locally. It does not know that the function it is rewriting is called by seventeen other services, two of which depend on its previous side effects. Human engineers build this map over years. Agents start fresh every session.

Business rule opacity

Business rules are the hardest thing to communicate to an agent — and the easiest thing for it to violate. "Premium users always get priority routing" is not in the codebase. It is in someone's head, or buried in a Confluence page from 2021, or implicit in a dozen conditional branches that grew organically over three years. The agent sees the code. It does not see the business.

Confident presentation of broken output

Agents do not express uncertainty the way humans do. A junior developer will say "I'm not sure this handles the edge case correctly." An agent will write confident, well-formatted code with a plausible explanation — that is entirely wrong. Your eval layer cannot rely on the agent flagging its own failures.

The Five Layers of a Production Agent Eval Toolset

A genuine eval and QA toolset for agentic code generation is not a single tool. It is a layered system, each layer catching a different class of failure.

Layer 1: Business Rule Contracts

The foundation. Before any agent touches your codebase, your business rules must be encoded as testable assertions — not documentation, not comments, executable specifications.

This means:

Invariant tests that describe conditions that must always be true regardless of implementation ("a cancelled order must never appear in the active queue")
Property-based tests using tools like Hypothesis (Python) or fast-check (JavaScript) that probe the space of inputs systematically rather than sampling specific cases
Domain-specific test fixtures that represent real scenarios from your business — not just "create a user" but "create a premium user in the APAC region with an active enterprise trial who has exceeded their monthly usage quota"

These tests are not written by the agent. They are written by humans who understand the business and maintained as first-class artefacts. The agent's output is evaluated against them.

Tools: Hypothesis, fast-check, Pact (for API contracts), Cucumber/Gherkin (for business-readable specs), pytest-bdd.

Layer 2: Functional Evaluation

Unit and integration tests are necessary but not sufficient. Functional evaluation goes further — it asks whether the agent's code does what was asked for, not just whether it runs without errors.

This requires:

Golden output comparisons — for deterministic functions, what should this return given this input? Store the expected output and assert against it after every agent commit.
Behavioural snapshots — record how the system behaves before the agent touches it, then assert the agent's changes preserve that behaviour where expected and change it only where intended.
Regression suites tied to production incidents — every bug that has ever reached production becomes a test case that must pass before any future agent change is accepted.

Tools: Vitest, Jest, pytest, Playwright (for end-to-end behavioural checks), Storybook (for UI component behaviour).

Layer 3: LLM-Based Semantic Evaluation

Some failure modes cannot be caught by deterministic assertions — they require judgement. For these, you evaluate the agent's output with another LLM that is specifically prompted to act as a domain-aware reviewer.

This is not asking GPT-4 "is this good code?" It is asking a carefully prompted evaluator model:

"Does this implementation honour the business rule that [specific rule]?"
"Would this code handle the case where [domain-specific scenario]?"
"Is this API response format consistent with the existing contract as described in [context]?"

The evaluator LLM should be given the business context, the existing codebase structure, and specific rubrics — not left to make aesthetic judgements.

Tools: LangSmith (LLM evaluation pipelines), Braintrust (eval framework with scoring), DeepEval (assertion-based LLM testing), RAGAS (for retrieval-augmented scenarios), Promptfoo (prompt and output evaluation).

Layer 4: Static Analysis and Security Gates

Agent-generated code must pass the same static analysis gates as human-written code — and in practice needs stricter gates because agents are more likely to introduce patterns that technically work but violate your security or style conventions.

Minimum requirements:

Type checking — TypeScript strict mode, mypy, or equivalent. Type errors in agent code are common and silent.
Linting — ESLint, Ruff, or equivalent configured for your actual codebase conventions, not defaults.
Security scanning — Semgrep rules tuned to your stack, Snyk for dependency vulnerabilities, Bandit for Python security anti-patterns. Agents frequently introduce SQL injection risks, insecure deserialization, and over-permissive CORS configurations.
Dependency auditing — agents sometimes introduce new dependencies that are outdated, vulnerable, or simply wrong for your licensing requirements.
Dead code and unreachable branch detection — agents often leave behind scaffolding and stub code that indicates incomplete implementation.

Tools: Semgrep, Snyk, SonarQube, ESLint, Ruff, mypy, TypeScript compiler, npm audit / pip-audit.

Layer 5: Runtime Observability and Rollback

Even the best pre-deployment evaluation will miss things. Production is the final test environment — which means your infrastructure must assume that agent-generated code will occasionally fail in production and be designed to catch and recover from it quickly.

This requires:

Feature flags on every agent-generated change, so you can disable specific functionality without a rollback deployment
Canary deployments that route a small percentage of production traffic to agent-changed code before full rollout
Real-time anomaly detection on business metrics, not just infrastructure metrics — a drop in checkout completion rate or a spike in payment failures is a business rule violation, not a server error
Automated rollback triggers that revert to the last known good deployment when key metrics breach thresholds
Distributed tracing so you can pinpoint exactly which agent change introduced which failure in which code path

Tools: LaunchDarkly, Flagsmith (feature flags), Datadog, Honeycomb, Grafana (observability), Argo Rollouts, Flagger (canary deployments).

Encoding Business Rules: The Hard Part

Every team that attempts to build agent eval tooling eventually hits the same wall: the business rules are not written down in a machine-readable form. They are distributed across product managers, engineers, customer success reps, and years of tribal knowledge.

This is not the agent's problem. It is a pre-existing debt that the agent economy is forcing you to pay.

Here is a practical approach to surfacing and encoding business rules:

1. Incident-driven rule capture. Every production incident becomes a rule. When agent-generated code causes a failure, write a test that would have caught it and write a human-readable rule statement that describes what the code was supposed to do. Build a living library of these.

2. Acceptance criteria as code. Require that every task given to an agent includes formal acceptance criteria written in a structure that maps to executable tests. Gherkin (Given/When/Then) is imperfect but is a proven format for bridging business language and test code.

3. Domain model tests. Write tests for your core domain model that are independent of any implementation. "A subscription in TRIAL status cannot generate an invoice" is a domain rule, not an implementation detail. Test the rule, not the code.

4. Contract tests between services. If your agents work on microservices, Pact-style consumer-driven contract testing ensures that agent changes to one service do not silently break the expectations of another. This is especially important because agents optimise locally and may not be aware of cross-service dependencies.

Designing the Eval Pipeline

With the five layers in place, the eval pipeline for every agent-generated pull request looks like this:

Agent produces code change
         │
         ▼
  ┌─────────────────┐
  │  Static Analysis │  ← Linting, types, security (fast, seconds)
  └────────┬────────┘
           │ pass
           ▼
  ┌─────────────────┐
  │  Unit + Contract│  ← Business rule contracts, golden outputs (minutes)
  │     Tests       │
  └────────┬────────┘
           │ pass
           ▼
  ┌─────────────────┐
  │  LLM Semantic   │  ← Evaluator model checks domain correctness (minutes)
  │  Evaluation     │
  └────────┬────────┘
           │ pass
           ▼
  ┌─────────────────┐
  │  Integration +  │  ← End-to-end behavioural tests (10-30 minutes)
  │  E2E Tests      │
  └────────┬────────┘
           │ pass
           ▼
  ┌─────────────────┐
  │  Canary Deploy  │  ← 5% traffic, monitor business metrics (hours)
  └────────┬────────┘
           │ healthy
           ▼
      Full rollout

Each gate is a blocking step. A failure at any layer stops the pipeline and returns the change to the agent — or to a human — for correction. The agent should receive structured failure output from each gate, not just a pass/fail signal, so it can attempt self-correction before escalating to human review.

The Eval Toolset Is Also a Feedback Loop

A well-designed eval pipeline does more than block bad code — it teaches the agent what good looks like.

When an agent fails an eval gate, that failure should be:

Logged with full context — what was the task, what did the agent produce, which rule did it violate, what was the exact failure message
Fed back to the agent as part of the next attempt — "your previous implementation failed because X; here is the specific failure and the rule it violated"
Aggregated across runs to identify systematic failure patterns — if the agent consistently fails business rule tests around payment processing, that is a signal to improve the rule documentation, the prompt context, or the agent's available tools

This transforms the eval toolset from a gate into a learning system. Over time, well-structured feedback loops produce agents that fail less often on your specific domain — because the failure modes are being systematically addressed.

What Teams Are Getting Wrong Today

Based on the patterns emerging across teams deploying agentic coding pipelines, here are the most common mistakes:

Relying on the agent to write its own tests. An agent that writes code and writes tests for that code will write tests that pass the code it wrote. This is not QA — it is circular validation. Tests for business rules must be written by humans who understand those rules.

Using "it compiles" as the success criterion. Compilable code that violates business logic is worse than broken code — because it is invisible to automated build systems and creates false confidence.

Skipping the LLM evaluation layer. Deterministic tests cannot catch semantic failures. "Returns a list" passes. "Returns a list sorted by the business priority ranking, with tied items in alphabetical order by customer name" fails — and only an evaluator with domain context can tell the difference.

No canary infrastructure. Teams that deploy agent code directly to 100% of production traffic with no rollback capability are one agent mistake away from a major incident. Canary deployment is not optional for agentic pipelines.

Business rules in prompts only. Rules described in the agent's system prompt are not enforced anywhere. They must be encoded as executable tests that run independently of the agent.

The Tooling Landscape (2025)

The eval and QA ecosystem for agentic systems is maturing fast. Key platforms worth evaluating:

LangSmith — Tracing, evaluation, and dataset management for LLM applications. Strong for tracking agent runs and building evaluation datasets from production traffic.

Braintrust — Structured eval framework with scoring functions, dataset versioning, and experiment tracking. Well-suited for iterative agent improvement.

DeepEval — Assertion-based testing framework specifically designed for LLM output evaluation, with built-in metrics for correctness, faithfulness, and relevance.

Promptfoo — Open-source prompt and model evaluation with support for custom test cases, red-teaming, and CI/CD integration.

Agenteval (AWS Labs) — Evaluation framework specifically for multi-step agent task completion, assessing whether the agent achieves the goal across a multi-turn trajectory.

Pydantic — While not an eval tool per se, Pydantic-enforced output schemas on every agent response are one of the highest-leverage interventions available. Structured output validation catches a surprising proportion of agent failures before they reach business logic at all.

Where to Start

If your team is deploying agents to production today without a proper eval layer, here is the minimum viable QA toolset to put in place this week:

Write five business rule invariant tests for the domain the agent is working in. Real rules, executable tests, no shortcuts.
Add static analysis to every agent PR — type checking and security scanning as blocking gates.
Set up LangSmith or Braintrust to log every agent run with its outcome — you cannot improve what you cannot measure.
Gate production on canary deployment — 5% traffic minimum before full rollout of any agent-generated change.
Create a failure log — every time agent code fails in production, add a test that would have caught it.

This is not a complete solution. It is a foundation. The complete solution is built iteratively, fuelled by real failure data from your specific domain.

The Stakes

The companies that figure out agent eval first will have a genuine competitive advantage — not because they are shipping faster, but because they are shipping reliably. Speed without reliability is noise. Speed with reliability is leverage.

The companies that skip this step are accumulating a different kind of technical debt: not in their codebase, but in their production systems — a slow accumulation of business rule violations, subtle regressions, and compounding failures that are increasingly hard to attribute to any single cause.

The agent is not responsible for the quality of its output. Your eval and QA toolset is.

Build it before you need it.

Browse the ARTE LOGICA directory for the latest eval frameworks, agent testing tools, and QA platforms shaping the agentic development stack.