Skip to content

IP, copyright, and provenance

Three different questions about ownership hide inside the phrase “we use AI for that.”

Can the output be owned? What was the model trained on? What is the company allowed to publish or sell?

The legal answer to all three is still settling. Several active lawsuits. New jurisdictional rulings every quarter. The picture will shift as cases land. What the substantive questions are, though, is stable — and the answers depend on which AI surface produced the output more than people usually realise.

What follows is the rough current consensus among reasonable in-house counsel — directional, not legal advice.

In most jurisdictions, the answer is some version of: not by itself.

The US Copyright Office has issued guidance that works generated purely by an AI cannot be registered for copyright. There has to be a human author, and the human’s contribution has to be something the office considers creative — selection, arrangement, modification. The prompt itself, in most rulings so far, does not count.

The UK, EU, and Canada are roughly aligned, with differing nuances. India and China have taken slightly different positions in specific cases. The picture is converging on: human contribution matters, and “human contribution” is being interpreted narrowly.

In practice:

  • A draft generated by AI and lightly edited is in a grey zone.
  • A draft generated by AI and substantially shaped, restructured, and supplemented by a person is typically copyrightable, at least for the human-added parts.
  • Pure one-shot AI output — “make me a logo” — is often unprotectable.

This matters most for the assets a company would sue over if a competitor copied them: brand identities, signature creative work, software with significant AI-written portions. For internal documents, briefs, and most working drafts, it is largely academic.

Worth noting: this is the same legal question regardless of where the AI came from. Output from a consumer chatbot, an internal tool built on an API, a self-hosted model, or a vendor product — all four sit on the same copyright line.

Question 2: What was the model trained on?

Section titled “Question 2: What was the model trained on?”

Large language and image models were trained on enormous quantities of internet-scale text and images. Much of that material was copyrighted. Most creators did not opt in.

Multiple lawsuits are pending. Authors Guild v. OpenAI. NYT v. OpenAI. Getty v. Stability AI. Concord Music v. Anthropic. Different jurisdictions, different theories, different stages. None has produced a decisive final ruling yet.

The defendants argue training is fair use — transformative, doesn’t reproduce the originals. The plaintiffs argue otherwise. The honest current state: no one can confidently say which way courts will land.

What this means in practice:

  • Training-data lineage is unresolved. Treat it as such.
  • Outputs are not “clean” in any guaranteed sense. They are usually fine. Occasionally they reproduce protected material recognisably.
  • The risk is small but non-zero, and the size of the risk depends on what gets done with the output.

This question also looks slightly different across surfaces. Frontier closed-model providers (OpenAI, Anthropic, Google) have not disclosed their training data in detail and are the central defendants in most lawsuits. Open-weights models from large labs (Llama, Mistral, Qwen) face similar questions — the training data was scraped from similar sources. A self-hosted model does not magically inherit cleaner provenance just because it runs on company hardware; it inherits whatever the people who trained it pulled from. The handful of models trained on licensed-only or public-domain-only data exist, but they are smaller and lag the frontier substantially.

Question 3: What is the company allowed to publish or sell?

Section titled “Question 3: What is the company allowed to publish or sell?”

This is the question that lands most directly on day-to-day work, because it touches everything customer-facing.

Distinctive image, video, and audio outputs. Image, video, and music models can occasionally produce outputs that recognisably resemble a specific copyrighted character, a real artist’s signature style, a trademarked brand. The model has seen all those things in training. Most of the time it generalises; sometimes it reproduces.

The risk scales with two things — how much the output is published externally, and how distinctive the output is. An internal mockup is low-stakes. An ad campaign with national distribution is not.

Code from code AI. Code copilots can occasionally produce verbatim copies of training data. For most short snippets this is irrelevant — function signatures are not creative work. For longer chunks of distinctive logic, it matters more. GitHub Copilot has a duplicate-detection filter that can be turned on; some other code tools offer similar. A self-hosted code model has whatever provenance signals its trainers built in, which is often less than the major vendors offer.

Text. Text in customer-facing work is the lowest-risk modality from a copyright standpoint. Models generally synthesise across many sources rather than reproduce any one. The bigger risk for text is factual: claims that are wrong, attributions that don’t exist, citations that are invented. That is not a copyright question, but it is the more common production failure with AI text. Covered in Building trust.

Some vendors offer IP indemnification on their enterprise products. OpenAI, Anthropic, Microsoft, Google, Adobe, IBM are among them, though the specifics differ.

The shape is roughly: if a customer is sued because of IP infringement in the AI’s output, the vendor will defend and cover damages, subject to conditions. The conditions typically include some combination of:

  • Must be on an enterprise plan, not consumer or free.
  • Must have certain safety and provenance features enabled (e.g. duplicate-detection on, content filters on).
  • Indemnity covers outputs, not training data — i.e. there is protection if the output infringes, not protection from being party to the underlying training-data lawsuits.
  • Excluded if the user knowingly prompted for infringing material (“make me an image of Mickey Mouse”) or significantly modified the output post-hoc.
  • Capped at certain dollar amounts or scaled to subscription tier.

The uneven part: indemnity exists in the enterprise-vendor lane and largely nowhere else. A model output produced by an employee in a consumer chatbot comes with no indemnity. A model output produced by an internal tool your team built on an API may come with API-tier indemnity from the underlying provider, or may not, depending on what the developer agreement says. A model output produced by a self-hosted open-weights model comes with no indemnity from anyone — the legal exposure stays entirely with the business running the model.

That asymmetry is worth holding in mind. The same image, produced two different ways, has very different legal-exposure profiles depending on which surface produced it.

Provenance — the cheapest control that is most often skipped

Section titled “Provenance — the cheapest control that is most often skipped”

The single highest-leverage move for the IP picture is keeping track of what produced what.

For any significant AI-touched asset — a published image, a marketing campaign, a brand identity, a software module — a simple record:

  • Which model.
  • Which prompt or prompts.
  • Which human edits.
  • Which version went out.

This is provenance. It does several things at once. It strengthens the human-contribution claim for copyright purposes. It makes the indemnity conversation faster if a question ever comes up. It lets a team reconstruct months later how something was made. And — increasingly — it is becoming a customer expectation in regulated and creative industries.

Provenance is cheap. It is mostly a metadata layer on top of whatever assets are already being stored. The engineering cost is small. The optionality is large. Most teams skip it because nobody asked for it; the ones that build the habit early are noticeably more comfortable when a question does arrive.

Across the questions above, there is a clean line and a messy line.

The clean line: AI-assisted work that is meaningfully shaped by a person, on a surface with strong vendor terms and indemnity, with provenance kept, going through a normal review process. That work is in roughly the same risk posture as any other creative or technical output a business publishes.

The messy line: AI output produced on an unaccountable surface (free consumer chatbot, self-hosted model with no provenance), with no human shaping, no provenance, no review, going straight to a customer-facing channel. Most published incidents that make the news live on this line.

The space between the two is wide. The question is which side a specific use case lives on, and that question is answerable.

The IP picture is messy and will be messy for some time. None of that is a reason not to use AI, and very little of it changes the day-to-day reality of work that is mostly internal, mostly draft-level, mostly low-stakes.

It does change the day-to-day for the slice of work that is customer-facing, distinctive, and external. For that slice, the legal exposure depends heavily on which surface produced the asset and what provenance exists. The downside scenarios — a published asset too close to someone’s protected work, an ambiguous claim about who owns the brand identity an AI tool generated — come from skipping the visible steps, not from the AI itself.

The companies that handle this well do not have the longest legal review process. They have built a small number of habits into the default workflow — meaningful human involvement on anything customer-facing, vendor terms understood, provenance kept — and made those habits the path of least resistance.