Designing Calibrated AI Evals for Analytics Engineering Agents

Overview

This case study describes a generalized process for designing evaluation tasks for AI agents working inside analytics engineering projects.

The goal was to evaluate whether an agent could operate in a realistic development environment: read instructions, inspect an existing project, reason through dependencies, make code changes, and produce an output that satisfied objective validation criteria.

The central challenge was calibration. A useful evaluation task should be difficult enough to reveal meaningful agent limitations, but not so underspecified that failure becomes inevitable. The work therefore focused less on creating a single correct answer and more on designing a repeatable loop for task design, validation, and refinement.

What Makes Agent Evals Different?

Traditional evaluations often measure whether a model can answer a question correctly. Agentic evaluations are more complex because the system is expected to act inside an environment.

An agent may need to:

understand a goal from realistic but incomplete instructions,
explore an unfamiliar project structure,
infer relationships between files and transformations,
make changes without breaking existing behavior,
validate its own work before submission.

For analytics engineering, this is especially important. The task is rarely just “write SQL.” The agent must understand data shape, transformation grain, naming conventions, dependency structure, and the difference between a plausible query and a maintainable project change.

Challenge

The challenge was to design tasks that produced meaningful signal.

A task that every capable agent solves is not very informative. A task that every agent fails is also not very useful. The target zone sits between those extremes: the task should be solvable, but only when the agent follows the instructions carefully, inspects the project, and implements the required behavior correctly.

A manual process for creating these tasks does not scale well. It requires repeated work across task ideation, reference behavior, validation logic, execution checks, failure review, and instruction refinement. Small changes in wording or validation can dramatically change task difficulty.

The design process therefore needed to become more systematic.

Approach

The improved workflow used an AI-assisted evaluation design loop.

AI assistance was used to accelerate parts of the process such as idea generation, draft implementation, instruction refinement, and failure analysis. However, the evaluation judgment remained separate from generation. A generated task still needed to be checked for clarity, reproducibility, validation quality, and resistance to superficial solutions.

The workflow followed five broad stages:

Task ideation
Generate candidate analytics engineering scenarios that resemble realistic project work.
Reference behavior
Define what correct behavior should look like without exposing unnecessary implementation detail.
Instruction design
Write task instructions that are clear enough to be fair, but not so prescriptive that they remove the reasoning challenge.
Validation design
Create objective checks that distinguish correct, partial, and incorrect submissions.
Calibration and refinement
Use observed failures to identify ambiguity, weak constraints, or overly easy paths through the task.

The principle was simple: use AI to compress the design loop, but use validation and observed behavior to decide whether the task is actually useful.

System Design

The process was structured as a repeatable loop rather than a one-off writing exercise.

A candidate task began with a realistic analytics engineering scenario. The scenario was then converted into expected behavior, task instructions, and validation checks. Trial runs were used to observe how agents interpreted the task and where they failed.

Those failures became design signals. If agents consistently misunderstood the same requirement, the instruction was likely ambiguous. If agents passed without demonstrating the intended reasoning, the validation was likely too weak. If agents failed for reasons unrelated to the target skill, the task was likely overconstrained or noisy.

This made calibration an evidence-driven process. Instead of guessing whether a task was difficult, the workflow used agent behavior to refine the task.

Execution

The execution process began with candidate task generation.

The objective was to create realistic analytics engineering work rather than isolated SQL puzzles. Candidate tasks were reviewed for whether they required project navigation, dependency reasoning, and careful implementation.

Once a candidate looked promising, a reference behavior was defined. This gave the task a clear target while allowing the public-facing instruction to remain concise.

The first instruction draft was treated as a hypothesis, not a final specification. Trial runs were used to test whether the task communicated the right intent. When agents failed, the failures were inspected for patterns such as incorrect grain, missed constraints, inappropriate assumptions, or incomplete project inspection.

The task was then revised. The refinement focused on removing accidental ambiguity while preserving the intended difficulty. Tasks that became too easy, too brittle, or too dependent on incidental project details were discarded rather than over-tuned.

This kept the process efficient. The goal was not to force every candidate into production quality. The goal was to identify strong candidates quickly and reject weak ones before they consumed too much effort.

Tradeoffs

The AI-assisted workflow reduced manual effort, but it introduced several tradeoffs.

The first tradeoff was speed versus control. AI assistance made it faster to generate candidates and draft supporting material, but generated work still required review. Without careful validation, an evaluation task can appear realistic while testing the wrong behavior.

The second tradeoff was realism versus debuggability. Larger project environments create better signal, but they also make failures harder to diagnose. The task needs enough complexity to resemble real work, but not so much complexity that the designer cannot explain why an agent passed or failed.

The third tradeoff was strictness versus fairness. Objective validation is valuable, but it can be unforgiving. A near-correct solution may still fail if it violates one important requirement. This makes instruction clarity and validation design tightly coupled.

The fourth tradeoff was calibration cost. Trial runs are useful because they reveal agent behavior, but every round of validation consumes time and compute. The workflow therefore favored early discard decisions over prolonged manual tuning.

Outcome

The workflow made evaluation design more systematic.

The most important improvement was not faster writing. It was the creation of a repeatable loop:

Generate a plausible task.
Define the intended behavior.
Build objective validation.
Observe agent behavior.
Revise ambiguous or weak instructions.
Keep, revise, or discard the task based on evidence.

This loop reduced reliance on intuition. Instead of manually guessing what would make a task useful, the process used observed agent failures as feedback.

The result was a cleaner evaluation design process: faster candidate generation, more disciplined task review, and clearer decisions about when to refine or discard a task.

Key Takeaways

Good evals require calibration, not just correctness.
A task is useful only if it lands in the right difficulty range for the agents being evaluated.
Agent failures are design signals.
Failures reveal ambiguity, missing constraints, weak validation, and common reasoning gaps.
Validation must match the target skill.
A task should test the behavior it claims to test, not incidental familiarity with a specific project structure.
AI assistance is useful, but not sufficient.
LLMs can accelerate task ideation and drafting, but validation and calibration determine whether the task is usable.
Discarding weak candidates is part of the process.
Some tasks are not worth rescuing. A disciplined discard rule keeps the evaluation pipeline efficient.
Realism needs boundaries.
The best tasks resemble real work while remaining explainable, reproducible, and fair.