Agents — when AI takes action

The frontier of AI right now is agents — systems where the model doesn’t just answer one question, but takes multiple steps, uses tools, and works toward a goal with minimal hand-holding.

The basic loop

A single-shot model call: you ask, the model answers, done.

An agent loop: the model decides what to do first, uses a tool, reads the result, decides what to do next, uses another tool, reads the result, keeps going until done or stuck.

Picture an employee given a task and a laptop. They check email, look up a customer in the CRM, draft a reply, send it, log the interaction in the spreadsheet, and move to the next item. Each step is a decision. Each decision depends on what came before.

An agent does this with the model as the deciding brain, and tools (search, CRM, email send, file write) as the hands.

Watch an agent work · decide · act · read · decide

0 / 8

Task

Pull invoices unpaid > 30 days, draft reminder emails.

Click Step to watch the agent work the task.

What agents are actually doing today

Simple agents. Booking a meeting that requires checking three calendars. Pulling data from three systems and assembling a report. Answering customer questions by querying the database, calling an API, and writing the reply. Tightly scoped, predictable steps, well-defined success.

Coding agents. Cursor, GitHub Copilot, Claude Code. These read your code, plan changes, edit files, run tests, fix what’s broken, repeat. For routine code tasks, they often outperform asking a model for a snippet — because they can read the surrounding code and check their own work.

Research agents. “Research our top three competitors and give me a brief on their pricing strategy.” The agent searches, reads, takes notes, searches more, eventually writes the brief. Quality varies enormously by topic.

Workflow agents. Sales follow-up automations, support triage, internal request routing. These tend to use models alongside traditional logic — model for the judgment calls, hardcoded rules for the predictable parts.

Where it works, where it fails

Works well when:

The task has clear steps, or a small set of choices at each step.
Tools are reliable — the calendar API actually works, the database returns clean data.
“Approximately right” is okay — a research brief, a first-draft response.
The agent runs for minutes, not hours.

Fails when:

The path forks unpredictably and the agent gets lost.
Tools are flaky — one bad API call cascades into bad downstream decisions.
Mistakes compound — each step’s small error adds to the next.
The agent has unbounded latitude (“organize my company’s operations”).

The honest status: simple to medium agents are working. Ambitious autonomous agents are still flaky. Most production “agents” actually have humans in the loop at key checkpoints.

The cost of letting a model act

When a model only writes text, mistakes are cheap. You read, you discard, you try again.

When a model acts — sends emails, books appointments, updates databases, spends money — mistakes are expensive. A real customer gets a confused email. A real meeting goes on a real calendar. A real query writes real data.

Good agent systems put guardrails at the right points:

Approval steps for irreversible actions (sending, paying, deleting).
Logs of every step taken, so you can audit and roll back.
Bounded scope — it can email customers, but not random people; it can update certain records, not all of them.
Fallback to human when uncertain.

Without these, an agent is a loaded gun pointed at a foot. With them, it’s a force multiplier.

The honest current state

Multi-step tasks that involve checking a few systems, reasoning over what comes back, and producing an output — these are working today, and they’re working in production at real companies. Coding agents are the most mature instance; well-scoped operations agents (booking, triage, internal lookups, drafted-with-tools customer responses) are the next tier.

Customer-facing or irreversible action is where the technology is real but the engineering around it has to be careful. Most production deployments here have a human checkpoint at the step that costs money or touches a real person. Removing that checkpoint is what blows up publicly.

“Let an agent run our operations” — the demo-video framing — is not yet a real thing. The technology is real but immature. The wins right now are in scoped automation, not autonomy. That gap will close, but slowly and unevenly, and the difference between “looks like it works in a demo” and “works for six months without supervision” is much larger than it appears.