How you know it's working — measuring AI

A team rolls out an AI tool for drafting customer support responses. Three months in, someone asks the manager, “Is it working?” The answer is, “The team likes it.” That’s not an answer. It’s the kind of response that ends the conversation without resolving anything — and it’s the response that gets reached for when nobody set up a way to know.

AI is genuinely hard to measure. The outputs are fuzzy. The baselines are unclear. The benefits are often qualitative (“faster,” “less drudgery”) while the costs are concrete (subscription fees, time learning the tool, occasional mistakes that need correcting). This is what makes it different from measuring, say, a CRM rollout — you can’t fall back on “we ran X reports last quarter, now we run 2X.”

What follows is how to actually know.

Three layers to measure

Any AI use case can be measured at three layers. Each is progressively harder. Each answers a different question.

Output quality. Is the AI’s work good enough on its own? The most concrete layer. When the model produces a thing, is the thing right?

Process change. Is the work happening differently or faster? Regardless of whether each individual output is perfect, has the workflow changed in a way that delivers value?

Outcome impact. Are the business results better? The hardest question. Did the AI move a metric the business actually cares about?

Doing all three is rare. Doing none is common. Doing one well is much better than doing all three badly.

Layer 1 — Output quality

This layer asks whether the model’s work is right. The way to find out is to read it.

Fifty random outputs from the first month, rated against the criteria that matter for the use case. Common criteria:

Complete — did it cover what was asked?
Accurate — are the facts and claims correct?
Appropriately toned — does it match the brand voice and context?
Usable as-is — could it be sent without editing?
Better than the alternative — is it better than a junior team member’s first draft?

Three methods are useful at this layer:

Spot checks. Random outputs, pulled regularly and read against the criteria. Patterns of failure surface this way — always wrong on dates, always too long, always too formal. Those patterns are what improve the prompt or the surrounding system.

Side-by-side comparisons. Same task, one version done by AI, one by a person. A third party rates which is better, ideally without knowing which is which. The most useful method when the question is whether AI should be in this workflow at all.

Rubric scoring. A defined rubric, each output scored 1–5 per criterion, tracked over time. The most useful method when prompts or models are changing — see the three knobs for which lever each change is pulling — and the question is whether each change is actually an improvement.

Frequency tends to be heavier at rollout (the system is unfamiliar, the failure modes haven’t been mapped), lighter at steady state (a handful of spot checks weekly), and heavier again after any change — new prompt, new model version, new vendor.

The single most useful artifact at this layer is a small, fixed evaluation set — 20–50 representative inputs — re-run every time something changes. This is the AI equivalent of a regression test. It doesn’t catch everything, but it catches obvious regressions and gives a comparable number across versions. Without it, prompt and model changes are flying blind.

Layer 2 — Process change

Even when the individual outputs are the same quality as before, AI can change how the work happens. That’s often where the real value sits.

Time per task. How long the task took before the rollout, measured against how long it takes after. In language-shaped work the common reduction is 30–60%. For some tasks it’s closer to 80%. For some, closer to 0% — which is also a finding.

Output per period. Same team, more output produced. Same output, less time spent. Either signals real process change.

Volume handled per person. A support team handling 200 tickets per day with the same headcount where it used to handle 130 — that’s the AI doing useful work.

Reduced rework. AI first drafts that need fewer revisions than human first drafts are a process improvement before any time-saved calculation is even run.

Three things distort this layer when read naively:

Hidden costs. A team that looks “30% faster” on the part of the workflow that’s measurable may be spending a quarter of that gain on the part that isn’t — prompting carefully, reviewing outputs, correcting mistakes. Honest process measurement covers the full cycle, not just the part the AI touches.

Quality regressions disguised as speed. Faster outputs that are worse outputs are not an improvement. Time without quality is half the picture, and the missing half is usually the one that matters.

Metric gaming. Any team that knows its “AI usage rate” is being tracked will use AI for the easy 80% and avoid it for the hard 20% — exactly backwards from where AI delivers the most value. Tracking usage as a primary metric quietly rewards the behaviour that produces the least benefit.

Layer 3 — Outcome impact

The hardest layer to measure honestly. Did the business actually change as a result of the AI being there?

Outcome metrics, by function:

Sales: revenue per rep, conversion rate, sales cycle length, leads worked per week
Support: customer satisfaction, first-response time, resolution rate, escalation rate
Marketing: campaign output volume, response rate, cost per qualified lead
Operations: cycle time, error rate, on-time delivery
Hiring: time to fill, candidate quality, screening throughput

Three things make this layer hard:

Confounding variables. Many things change at once. A new manager, a pricing shift, a competitor stumble, a product launch — any of them can move the same metric the AI is being held accountable for.
Outcome lag. Sales effects land in quarters. Customer satisfaction effects land in months. A six-week-old rollout against a metric that hasn’t moved is either a failure or simply early, and from inside the moment those look identical.
Noise. Quarter-to-quarter variation in most business metrics is large. A real 5% improvement can be invisible inside the normal range of variation.

The practical methods that still work:

One or two metrics, not ten. The “comprehensive scorecard” approach to outcomes dilutes the signal. Picking the one or two that matter most for the use case is more useful than tracking a dashboard nobody reads in full.

Baseline before rollout, for at least one or two cycles. No baseline, no comparison. Without a record of where things were before the AI arrived, any claim about what it changed is unfalsifiable in either direction.

Post-rollout against baseline, with the confounds named. If the metric moved, what else changed in the same period? If several things changed, the AI’s contribution cannot be cleanly isolated — and saying so honestly is more useful than a confident attribution that doesn’t hold up under scrutiny.

A/B test where feasible. Half the team uses AI, half doesn’t. Outcomes for both groups tracked over a quarter. This is the gold standard at this layer. It’s also rare, because most companies roll AI out to everyone at once and lose the comparison.

A/B testing AI in a business context takes discipline — split assignment, no spillover between groups, fair access to compute and training — but the signal is exceptionally strong when it works. A six-week A/B test of “AI in sales drafting” with a clear conversion metric produces more reliable information than six months of dashboard-watching.

What looks measurable but isn’t

Several things present themselves as measurement and aren’t.

“AI usage.” How often the AI is being called says nothing about whether the work is better. A team can use AI heavily and produce nothing of value. Vendor dashboards full of “AI activity metrics” are usually the worst kind of measurement — they make leadership feel informed without informing them.

“Team satisfaction with AI.” Worth knowing, but a team can genuinely enjoy a tool that isn’t helping them produce more or better work. Equally, a team can resent a tool that is measurably making them faster. Satisfaction is a signal, not the answer.

“Token volume” or “request count.” Measures of cost, not value. Useful inside an infrastructure conversation, misleading inside an impact conversation.

“Number of features used.” Having more AI capabilities in the stack doesn’t mean more value delivered. This is a vendor-side metric that has wandered into customer dashboards.

Vanity dashboards. Any dashboard where the metrics look impressive but none of them trace to a business outcome is measuring engagement, not impact. They tend to grow when measurement is delegated and nobody is held to outcomes.

A minimum responsible measurement setup

For any AI tool in production, the floor is roughly:

A small evaluation set of 20–50 representative inputs, re-run regularly, results tracked over time.
A rough before/after time-per-task or output-per-period measurement. Back-of-a-napkin is acceptable; absent is not.
Honest tracking of one or two outcome metrics that matter for the use case, with a baseline from before rollout.
A named human reviewer for the worst-case outputs. The rare bad outputs matter more than the average good ones, and someone has to be watching. (Building trust covers the review patterns at each blast-radius level.)

A sophisticated dashboard isn’t required. A spreadsheet updated monthly is more than most companies do. The discipline of actually doing the measurement matters more than the sophistication of the system around it.

When a project isn’t working

The hardest measurement decision is the one to stop. AI delivers in some use cases and not others — and the gap between “we haven’t found the angle yet” and “this isn’t the right fit” is where many AI projects quietly stall.

The honest signals that it isn’t working in a given case:

After two to three months, no clear time or output improvement at the team level.
The team has quietly stopped using it. (This often happens without anyone saying so explicitly. Usage data is the only reliable read.)
Quality issues require so much review that net time is the same or worse than before.
Outcome metrics haven’t moved beyond noise.
The cost is real and recurring. The benefit, asked plainly, is “vibes.”

At that point, the responsible move is to stop or rescope. Sunk-cost reasoning — “we’ve invested so much already” — isn’t an argument. The next month of subscription is the only money still on the table; the previous six are gone either way.

AI projects routinely get carried past the point of helpful learning into the point of sustained drag. The discipline of saying “this isn’t working in its current shape” is rarer than it should be, and worth practising before it’s needed.

Baseline before, not after

The single most valuable measurement habit is to write down the baseline before rollout. Not after.

“The team currently handles 130 tickets per day with average resolution time of 18 minutes and CSAT of 4.2” — written down before the AI arrives — is worth more than any dashboard built later. Once the rollout happens, the past is gone. There is no honest way to reconstruct what the workflow used to be.

Five minutes of pre-rollout measurement settles a lot of six-month debates about whether AI is actually working. The debate becomes a comparison instead of an argument.