The big model families
A small number of labs ship most of the frontier models. Knowing the cast is half of reading any news cycle in this space.
The frontier labs
Section titled “The frontier labs”OpenAI — the GPT family. The flagship general-purpose models plus a parallel line of “reasoning” models that take more time and tokens to think through a problem before answering. ChatGPT runs on these. Probably the most-recognised AI brand in the world.
Anthropic — the Claude family. Three tiers ship alongside each other: Opus at the top, Sonnet as the workhorse, Haiku at the small/fast end. Long context windows, careful tone, strong on code and long-document work. Widely used in enterprise settings.
Google — the Gemini family. Pro at the top, Flash as the small/fast tier. Strong on multimodal — handling images, audio, and video alongside text. Wired through Google’s products: Gmail, Docs, Workspace, Search.
Meta — the Llama family. Open-weights — the model file is publicly downloadable. Below the closed frontier in raw capability, but free, inspectable, and runnable on infrastructure the company controls (more on what that means in the next chapter).
Mistral — a French lab. Ships both open-weights and hosted closed models. Smaller, faster, and notably present in European enterprise contexts where data residency matters.
DeepSeek — a Chinese lab. Open-weights. Has shipped models that perform well above their compute and pricing tier, and meaningfully narrowed the gap with the closed frontier.
A few others come up situationally — xAI’s Grok, Alibaba’s Qwen, Cohere, sometimes Apple’s in-device models — but the six above are the cast in most conversations inside a business.
Tiers within each family
Section titled “Tiers within each family”Each lab ships at roughly three sizes. The names change every six to twelve months. The pattern doesn’t.
Frontier tier. The smartest, slowest, most expensive. The “thinks before answering” models live here. Used for hard reasoning, long-form work, anything where quality matters more than speed or cost per call.
Workhorse tier. Meaningfully cheaper, almost-as-good for most everyday tasks. This is what most production applications actually run on. The price-quality sweet spot.
Small / fast tier. Fast, cheap, less capable. Used for high-volume, low-difficulty tasks — classification, routing, simple extractions, anything that needs to run millions of times a day.
The whole curve shifts down over time. The frontier model of the previous generation is roughly the small/fast tier of the next one, at a fraction of the price. The same drift is expected to continue. Any cost or capability assumption locked in today has a short half-life.
What “best at” actually means
Section titled “What “best at” actually means”Marketing pages and benchmark tables say things like “best at coding,” “most creative,” “top of the leaderboard.” A few things to hold in mind when reading them:
- The numbers change every few months as new models ship. Today’s leader is often tomorrow’s middle of the pack.
- Benchmarks are necessarily generic. They measure performance on standardised tasks that may or may not resemble the specific job the model would actually be put on.
- They don’t account for prompt quality, context, or the way the model is being used in production. The same model wired into a thoughtful workflow can outperform a “smarter” model wired badly.
- They don’t capture taste — the ways one model phrases things, handles ambiguity, hedges or commits, lays out structured output. For real work, taste is most of the experience.
For most everyday business tasks — drafting, summarising, extracting, classifying, answering questions over documents — the top three or four frontier models are interchangeable in any way the user would notice. The differences matter at the edges: extremely long context, hard reasoning, specialised code work, fluent multimodal, less-resourced languages, very strict instruction-following.
How “which is best” actually settles
Section titled “How “which is best” actually settles”In practice, the question rarely settles by benchmark. It settles by use — somebody running real work on a model for a few weeks, noticing where it shines and where it frustrates, and forming a taste. Different teams arrive at different defaults for that reason, and the same team’s default changes as new models ship.
A reasonable mental anchor: the top closed frontier models (the current generation of GPT, Claude, Gemini) are the safe default for general work; specialised tasks (very long documents, code, multimodal, very high volume) are where it can be worth looking at workhorse or open-weights alternatives. Past that, it’s a question of which model the team has actually spent time with — which is genuinely the better guide than any leaderboard.