Beyond text — image, audio, and video AI

Most of Module 1 was about text — tokens going in, tokens coming out. The interesting modalities now extend well past text: image generation, voice agents, AI video, AI reading screenshots and documents and charts. The useful surprise is that under the hood, it’s the same machine. Same loop. Different ingredients.

The pattern under all of it

A model is a pattern-matching machine that predicts the next chunk, conditioned on what came before and on the prompt.

For text, the chunks are tokens — roughly: words, sometimes word-pieces.

For images, the chunks are patches — small compressed representations of regions of the image. (Pixel-by-pixel would be too slow; modern models work at the patch level.)

For audio, the chunks are samples of sound at very small intervals, or compressed representations of them.

For video, the chunks are frames, or smaller pieces within frames stitched across time.

Same loop. Different ingredients. The text intuition from Module 1 — pattern-matching from enormous training data, conditioned on a prompt — transfers directly. Each modality has its own quirks at the edges, but the core machine is the same.

Image generation

Models like Stable Diffusion, DALL·E, Midjourney, Imagen, and Flux generate images from text prompts. Type “a coffee shop in monsoon Mumbai, soft afternoon light”; the model returns an image that matches.

Under the hood, the model has been trained on hundreds of millions of image-caption pairs. It has learned the patterns linking text descriptions to visual structure. Given the prompt, it generates an image by progressively refining noise into a recognisable picture — the math (diffusion) is different from text generation, but the idea in spirit is the same: pattern-matching from training data, conditioned on a prompt.

What image generation is good at:

Marketing concepts, mood boards, brainstorm visuals
Illustrations, stock-photo replacements, decorative imagery
Product mockups (with caveats below)
Stylised art, design exploration

Where it struggles:

Specifics of a particular product or brand. The model has never seen them — same problem as text. Workarounds exist (LoRAs, fine-tunes, reference-image conditioning) but add work.
Faces. Small inaccuracies in human faces are perceptually obvious. Much better in the current generation of models; not solved.
Hands and limbs. Extra fingers and anatomical errors are the running joke for a reason.
Text inside images — signs, labels, captions, on-pack copy. Often nonsense, getting better but unreliable.
Factual consistency across regenerations. The same character in two generated images is rarely actually the same character without extra plumbing.

The honest summary: image generation is solid where “approximately right” is fine — concept, mood, illustration, exploration. It is brittle wherever a specific person, a specific product, or specific words on a label have to come out exactly. Disclosure requirements around AI-generated imagery are also tightening in most major jurisdictions and worth knowing about.

Voice and audio

“Voice AI” gets used as if it were one thing. It is at least three.

Speech-to-text (transcription). Audio in, text out. Mature. Reliable for major languages, decent for most accents, weaker on overlapping speakers and heavy background noise. Meeting notes, call transcripts, dictation. Mostly a solved problem in practice.

Text-to-speech (voice synthesis). Text in, audio out. Has moved from clearly robotic to near-indistinguishable from human in the current generation of models. Voice cloning — training a synthetic voice on a few minutes of someone’s speech — is also routine, with its own significant disclosure and consent concerns.

Voice agents. Audio in, audio out, with a model in the middle. The classic build is three pieces stitched together — speech-to-text, then an LLM that reasons, then text-to-speech. Newer end-to-end voice models do it as one model that takes audio in and produces audio out, which is faster and feels more natural in conversation.

Where voice agents currently work:

Narrow, well-defined domains: appointment booking, FAQ support, password resets.
Read-back-the-information tasks: account balances, order status, delivery windows.
Outbound calls with scripted-but-flexible flows: reminders, surveys, simple confirmations.

Where they still break:

Open-ended conversation across many topics.
Emotional cues — frustration, sarcasm, distress — that humans pick up instantly.
Multi-speaker situations: conference calls, families on a household line, anyone handing the phone around.
Strong accents, noisy environments, fast topic-switching.
Knowing when to escalate to a human, and doing it gracefully.

The picture today: transcription is production-ready. Voice synthesis is production-ready with appropriate disclosure. Voice agents are real in narrow domains; in open-ended customer-facing settings they work some of the time and frustrate the rest. The category is improving quickly enough that the line between “narrow” and “open-ended” is moving outward each release cycle.

Image understanding (not generation)

A separate capability that gets bundled under “vision AI” — a model that reads an image and produces text about it. Paste a screenshot into a chat, ask “what does this dashboard say?”, and get a useful answer back.

This is a different job from generation. The model is reading the visual input and producing text. Frontier models handle this well in the current generation.

Useful for:

Extracting structured data from screenshots, photos of documents, scanned receipts and invoices.
Reading charts and graphs (with verification — small errors are common, especially on dense data).
Generating image descriptions for accessibility.
Identifying defects or anomalies in product photos.

For most everyday business work, image understanding is the workhorse vision capability. It’s more practical, more reliable, and shows up in more real workflows than image generation — but it gets less marketing attention.

Video AI

Video is the youngest of the major modalities. Two distinct capabilities are emerging at very different maturity levels.

Video understanding. A model watches a clip and describes what’s in it, transcribes it, or summarises it. Working for short clips at reasonable quality, with the usual asterisks around long-form content and dense action.

Video generation. A model produces a new video from a text prompt — and, increasingly, from an image or a reference clip. Has only recently crossed into usable territory. Quality jumps every few months. In the current generation of models, photorealistic short clips (a few seconds, up to about a minute) are achievable for many subjects; physics consistency across longer takes (water, fluids, smoke, crowds, hands doing precise things) is still where the cracks show.

What it’s good for:

Short stylised marketing clips and social-first content.
Internal explainers, training content, demo loops.
Background motion, decorative animation.
Pre-visualisation and concept exploration before committing to a real production.

What it isn’t ready for:

Long-form photorealism.
Anything requiring lip-sync with a specific actor.
Anything where physics has to hold together across many seconds.
Production-quality work that competes head-on with traditional video.

The trajectory is steep. Use cases that don’t work now often work a release cycle later. Anything written about “what AI video can do” has a short shelf life.

Multimodal — one model, many media

Frontier models now handle multiple modalities at once. Paste an image into a chat and ask about it. Upload a PDF with text and embedded charts; the model reads both. Some models accept audio directly. The same underlying architecture is bridging the modalities internally, without needing separate extraction steps.

This matters for any workflow where information naturally lives in multiple forms — a customer ticket with screenshots attached, a document mixing text and diagrams, an invoice that exists only as a scanned image, a meeting that’s part transcript and part shared screen. A multimodal model takes the whole input as one prompt instead of needing a pipeline to convert everything to text first.

The reverse is also coming online: models that produce mixed-media output. A response that interleaves text and a generated chart, a slide deck assembled from a prompt, a voice agent that hands off to a visual interface mid-call. The architectural story is still consolidating, but the direction is the same — one model, many surfaces in and out.

What carries across every modality

The limits from Module 1 transfer almost line-for-line.

The same hallucination problem. Visual hallucinations: a fabricated logo, a made-up label, a wrong face. Audio hallucinations: a transcribed word that wasn’t said. Same mechanism — the model is producing what fits the pattern — different surface.
The same context limits. A model has finite room for image, audio, or video input. Hour-long video, multi-channel audio, very high-resolution images all push against ceilings; chunking and summarisation become the workarounds, with the same trade-offs as for long text.
The same “approximately right” problem. A 90%-correct image is great for a mood board and useless for a logo. The acceptable margin depends entirely on the job.
The same “no business context” problem. The model has never seen the specific product, the specific customers, or the brand voice. It is generic until conditioned on the company’s data.
The same training-data dependence. The model is what it ate. Underrepresented domains — a regional accent, a niche industrial process, an unusual product — show quality drops the same way they do in text.

The fundamentals from Module 1 are the same fundamentals here. Once the loop is intuitive in one modality, the others are variations on it — different chunks, different failure modes at the edges, the same machine underneath.