Skip to content

What to watch for

A prototype that worked beautifully in a demo enters production. The model gets called a thousand times a day instead of ten. Real customer data starts flowing through it. Someone updates the model version. Six months later, four categories of risk have shown up that weren’t visible at launch.

None of them are reasons not to ship AI. They’re shapes the production system has to be designed around — and they look very different from the risks of a traditional software rollout.

API-based AI is metered. Per request, per token, per model. Costs that were rounding errors in the prototype can balloon in production, often without anyone noticing until the invoice.

The main drivers:

  • Volume. A tool that runs 10x more often than expected costs 10x more. Production traffic is reliably higher than pilot traffic.
  • Model choice. Frontier models can be 10–50x more expensive than workhorse models. Most of the time, the workhorse is fine — the frontier model is being used out of caution, not necessity.
  • Prompt size. A 10,000-token prompt costs roughly 5x what a 2,000-token one does. Long-form context is convenient and expensive.
  • Long-context patterns. Stuffing entire documents into every request — instead of retrieving only the relevant parts (see tools and memory) — multiplies cost across every call.

What sensible cost design looks like in production: cost is measured per use, not just as a monthly total. “$10,000 a month” means nothing without “across how many uses.” Stable parts of prompts get cached where the provider supports it. Workhorse models do the work that doesn’t need a frontier model — classification, routing, simple extraction. Hard ceilings and spend alerts catch anomalies before they become invoices.

Data flows through the model. Where the model lives, what gets logged, and how that data is handled is something to know explicitly — not assume. (Where your data goes lays out the four surfaces in full.)

The dimensions that determine the security posture of any AI in production:

  • Where the model is hosted. Lab-hosted (the model provider’s servers), customer cloud (e.g. AWS Bedrock, Azure OpenAI), or on-premise. Each has different data-residency and access implications.
  • What gets logged. Whether inputs are stored, for how long, and who inside the provider can see them. This varies by tier and contract.
  • Training use. Whether inputs may be used to improve future models. Most enterprise tiers explicitly exclude this; consumer tiers historically have not.
  • Encryption. In transit and at rest, both of which most enterprise contracts now cover by default.
  • Regulated data. For health, finance, or government data, whether the setup meets the relevant compliance regime (HIPAA, SOC 2, PCI, sector-specific frameworks).

The most common mistake at this dimension is using consumer tiers (free ChatGPT, free Claude.ai) for sensitive operational data. The free tiers are convenient and not designed for that purpose. Enterprise tiers usually are, but the specifics depend on the vendor, the tier, and the contract — not on assumption.

Once AI is broadly available inside a company, decisions get made on its output. Some of those decisions are low-stakes (rewriting an internal note). Some are high-stakes (the response a customer sees, the candidate that gets rejected). Without a governance layer, the line between the two gets walked over.

The minimum governance surface looks roughly like this:

  • Permission map. Which tools are approved for which use cases. Default-allow for low-risk work, default-deny for high-risk work. The boundary is explicit.
  • Review structure. For anything customer-facing or decision-affecting, a named reviewer. “AI handles the draft, the team handles the verdict” only works if the verdict has an owner. (Building trust covers the review patterns at each level of blast radius.)
  • Auditability. When a customer questions a response or a regulator asks how a decision was made, the system can reconstruct what model was called, with what prompt, against what data.
  • Confidential data boundaries. What goes into a model and what doesn’t. Customer PII, internal IP, salary information, M&A material — most of this lives on the don’t-input list, but it has to be a written list, not an assumed one.

This doesn’t need to be a forty-page document on day one. A one-pager the team has actually read sits well above a comprehensive policy that lives in a folder nobody opens.

Models update. Behaviours change. A prompt that produced the right output in May may produce subtly different output in November — same words, different result.

The forms drift takes:

  • Silent updates. Closed model providers update the model under live traffic. The version name doesn’t change; the behaviour does.
  • New defaults. A provider releases a new model and rolls existing users onto it, often with slightly different tone, refusal patterns, or formatting.
  • Retirements. Older model versions get sunset. API calls to retired endpoints start failing, sometimes with notice and sometimes with less.
  • Style shifts. Even within a stable version, response style can drift subtly through provider-side fine-tuning.

What production systems do about it: pin model versions where the provider supports pinning, and don’t auto-upgrade. Run the evaluation set whenever the model version is intentionally changed, to catch regressions before they reach users. Wire in a fallback model so retirements don’t cause downtime. Treat prompts as living artefacts that need re-evaluation on a regular cadence — quarterly is a reasonable default for production work. (Prompting is one of the three knobs; a drift event often calls for a small turn of that knob, not a panic.)

The floor for any AI in production:

  • Cost monitoring with per-use measurement and spend alerts.
  • A reviewed data-handling agreement that matches the sensitivity of the data flowing through.
  • A named human reviewer for high-stakes output.
  • A pinned model version with a written re-evaluation cadence.
  • A one-page usage policy the team has read.

None of this is expensive to set up. Skipping it is the expensive option — paid in surprise invoices, leaked data, unreviewable decisions, or quiet drift that nobody catches until something visible breaks.