AI Distillation War: What It Means for Your Data

By Riz Pabani on 24-Feb-2026

AI Distillation War: What It Means for Your Data

They Robbed the Robbers: What the AI Distillation War Means for Your Data

Anthropic just accused three Chinese AI labs of using 24,000 fake accounts to reverse-engineer Claude.

DeepSeek, Moonshot, and MiniMax collectively sent over 16 million prompts to Claude's API. Not casual questions. Carefully structured queries designed to extract how Claude reasons, writes code, and uses tools. Then they fed those responses into their own models.

Anthropic calls this "distillation." In plain English: ask the smart model enough of the right questions and your model learns to copy it. MiniMax alone accounted for 13 million of those exchanges. When Anthropic released a new Claude model mid-campaign, MiniMax redirected half its traffic to the new version within 24 hours. They weren't experimenting. They were running an industrial extraction operation.

This is a real story with real national security implications. I'm not dismissing it.

But I do want to talk about the irony. Because nobody else seems to want to.

How These Models Were Built in the First Place

Anthropic built Claude by training it on text written by other people. A lot of text. They downloaded over 7 million books from pirate websites to use as training data. Last September, they settled with the authors for $1.5 billion. The largest copyright settlement in US history.

OpenAI is facing similar lawsuits. Google scraped the open web at a scale most people can't picture. Every major AI lab did some version of this. The models exist because they memorised statistical relationships between trillions of words that other people wrote.

These are autocomplete machines. But the raw material was always other people's work.

So when Anthropic says "they stole our reasoning," yes, the terms of service were violated, fake accounts were created, and the security implications are serious. But the underlying pattern is the same one that built the models in the first place: take someone else's output and use it to train your system.

Reddit's summary was three words: "They robbed the robbers."

Now Look at What's Happening on Your Desktop

Here's where this gets personal.

Anthropic recently launched Cowork, a desktop AI agent that works in your files, your browser, and your dashboards. I use it daily for content planning and SEO research. It reads my Google Search Console data through the browser and builds out my content schedule in Linear. Useful stuff.

But look at the plugin categories that shipped alongside it: marketing, sales, competitive intelligence, account research, call prep, performance analytics, content creation, brand voice. Those are just the ones from Anthropic. Third-party plugins already cover wealth management, investment analysis, and legal research.

Read that list again. Those are job descriptions. The kind that pay six figures in the City and Canary Wharf.

These agents aren't just helping you do your work. They're sitting inside your working day, watching how you do it. Every prompt you write. Every workflow you build. Every spreadsheet you ask it to analyse. Every client name you mention in a brief. That data flows through someone's infrastructure.

Maybe it's not being used for training today. But the pipes are there. The logging is there. And the terms of service give the labs more room than most people realise.

The Distillation Problem Is Your Problem Too

The distillation attacks on Claude worked because the model's outputs contain its reasoning. You can reverse-engineer how it thinks by studying what it produces. That's exactly what DeepSeek, Moonshot, and MiniMax did, at industrial scale.

Now think about your own AI usage. Every time you use a closed-model agent to process client data or draft strategy documents, you're sending the patterns of your professional work through someone else's system. Your prompts describe how you think about problems. Your tool calls reveal your workflows.

If the biggest AI labs in the world, with dedicated security teams and detection classifiers, couldn't stop each other from extracting their proprietary capabilities, what's protecting your firm's workflows?

This comes up constantly in my AI coaching sessions. People want to use the tools but they're worried, reasonably, about where their data goes. I explain that you don't have to send everything to OpenAI or Anthropic's servers. There are open-source models under 1 GB that run entirely on your CPU. No internet connection needed. And if you want something closer to state-of-the-art, a couple of Mac Studios can run a full open-source model locally. The capability is there. Most people just don't know it exists.

The Case for Running Things Locally

This is where open-source models start to matter. Not for ideological reasons. For practical ones.

Tools like Ollama and LM Studio let you run language models on your own hardware. Nothing leaves your machine. No API calls. No logging on someone else's server.

You lose some capability. Claude is still better at writing. GPT-4o is still better at certain reasoning tasks. But for anything involving client data or regulated industries, the trade-off looks very different.

I keep an updated list of the best LLMs in 2026, including the open-source ones (DeepSeek R1, LLaMA 4, Mistral Large 3, Qwen 3) with context windows and use cases. Worth a look if you're evaluating your options.

The hybrid approach makes sense to me. Use closed models for general research and content where the data isn't sensitive. Use open-source models for anything you wouldn't want a third party reading. Same way you'd use Google Docs for a marketing brief but wouldn't put your cap table in there.

Questions Worth Asking at Your Next Team Meeting

If you're using AI tools in your organisation, especially agents with access to files, browsers, and internal systems, here are the questions I'd be raising:

Do you know which of your AI tools log your prompts? Most do. Some retain them for training unless you explicitly opt out. Check the data retention policy, not the marketing page.

Are you rotating API keys and logging tool calls? If you're building automations on top of AI APIs, treat them like any other piece of production infrastructure. Audit trails matter. If you're running a self-hosted AI setup, the same principle applies.

Has anyone assessed open-source alternatives for your sensitive workflows? Running a local model for client-facing work while using Claude or GPT for internal research is one way to split it.

Who in your organisation is actually reading the terms of service? Not the summary. The actual document. Specifically the sections on data usage, model training, and third-party access.

The Bigger Picture

The AI distillation war between labs is a good story. Espionage, trademark drama, fake accounts, geopolitics. It'll get a lot of attention this week.

But the story underneath it is quieter and closer to home. AI agents are moving from "tools you use" to "systems that observe how you work." The data flowing through these systems, your prompts, your workflows, your professional patterns, is valuable. The labs know it. The Chinese labs proved it by spending months extracting it from each other.

The question is whether you're treating your own AI usage with the same seriousness.

This is one of the things I cover in my AI training sessions. How to use these tools without giving away the shop. If that's relevant to your team, drop me a message.


Riz Pabani is an AI trainer based in London, offering 1:1 and group AI training sessions for individuals and businesses worldwide. Learn more about Riz.

Related Articles