Forum: Jacob's Hideout BBS

Phrase Of The Week: ?Distillation Attack?

From Lawrence D?Oliveiro@3:633/10 to All on Tuesday, April 28, 2026 03:00:20

It?s not enough that the creators of the AI models have scraped
information from millions of websites to build their systems, now they
have to worry about competitors short-circuiting the process by
extracting information wholesale from their models, by making hundreds
or thousands of requests and combining the results. This is called a ?distillation attack?. Some AI service providers try to say this is
against their terms and conditions, but how do you define what is and
isn?t allowed, exactly?

<https://arstechnica.com/ai/2026/02/attackers-prompted-gemini-over-100000-times-while-trying-to-clone-it-google-says/>:

In March 2023, shortly after Meta?s LLaMA model weights leaked
online, Stanford University researchers built a model called
Alpaca by fine-tuning LLaMA on 52,000 outputs generated by
OpenAI?s GPT-3.5. The total cost was about $600. The result
behaved so much like ChatGPT that it raised immediate questions
about whether any AI model?s capabilities could be protected once
it was accessible through an API.

A big worry in the US now is that China is doing this to train its AI
models <https://arstechnica.com/tech-policy/2026/04/us-accuses-china-of-industrial-scale-ai-theft-china-says-its-slander/>.
And of course, as soon as it is seen as a ?national security? issue,
they start to lobby the politicians to pass laws against it. Imagine
that: making too much use of the service, or prompting it according to
certain forbidden patterns, could now become criminal violations of
laws such as the Economic Espionage Act and the Computer Fraud and
Abuse Act.

Another quote from the February article:

As long as an LLM is accessible to the public, no foolproof
technical barrier prevents a determined actor from doing the same
thing to someone else?s model over time (though rate-limiting
helps), which is exactly what Google says happened to Gemini.

--- PyGate Linux v1.5.14
* Origin: Dragon's Lair, PyGate NNTP<>Fido Gate (3:633/10)

From Richard Kettlewell@3:633/10 to All on Tuesday, April 28, 2026 08:41:22

Lawrence D?Oliveiro <ldo@nz.invalid> writes:

Another quote from the February article:

As long as an LLM is accessible to the public, no foolproof
technical barrier prevents a determined actor from doing the same
thing to someone else?s model over time (though rate-limiting
helps), which is exactly what Google says happened to Gemini.

Google, OpenAI and friends are rolling in money, no doubt they can look
after themselves. But there?s a more general question there which is how
to solve the problem of distributed attack in general including spam,
port scanning, DDoS, etc etc etc, and in fact also including large-scale
web scraping to feed LLMs.

Anecdotally at least, a big part of the problem currently is
?residential proxy networks?, i.e. hijacking of domestic Internet
connections, which allows attackers to use huge numbers of independent
IP addresses. If you block 1000 addresses today, another 1000 will be
along tomorrow.

A solution to that would be a service to humanity.

--
https://www.greenend.org.uk/rjk/

--- PyGate Linux v1.5.14
* Origin: Dragon's Lair, PyGate NNTP<>Fido Gate (3:633/10)

From Nuno Silva@3:633/10 to All on Tuesday, April 28, 2026 09:45:12

On 2026-04-28, Lawrence D?Oliveiro wrote:

It?s not enough that the creators of the AI models have scraped
information from millions of websites to build their systems, now they
have to worry about competitors short-circuiting the process by
extracting information wholesale from their models, by making hundreds
or thousands of requests and combining the results. This is called a ?distillation attack?. Some AI service providers try to say this is
against their terms and conditions, but how do you define what is and
isn?t allowed, exactly?

First, it's of course allowed legally, on the same grounds they used to
build it, in some countries it might even be illegal to impose
restrictions on reusing information like that.

Second, well, this is clearly a good test for the whole thing, either
there is fairness and they're told "no can do", that they must handle
the so-called "attack" themselves with no legal remedy, or they're told
they also can't do DDoS attacks on the open web.

Sadly, I think we know where will the USA lie on this...

<https://arstechnica.com/ai/2026/02/attackers-prompted-gemini-over-100000-times-while-trying-to-clone-it-google-says/>:

In March 2023, shortly after Meta?s LLaMA model weights leaked
online, Stanford University researchers built a model called
Alpaca by fine-tuning LLaMA on 52,000 outputs generated by
OpenAI?s GPT-3.5. The total cost was about $600. The result
behaved so much like ChatGPT that it raised immediate questions
about whether any AI model?s capabilities could be protected once
it was accessible through an API.

It is all public domain. Maybe it'll take time for them to realize that
that doesn't only mean they can get away with their derivative works...

A big worry in the US now is that China is doing this to train its AI
models <https://arstechnica.com/tech-policy/2026/04/us-accuses-china-of-industrial-scale-ai-theft-china-says-its-slander/>.
And of course, as soon as it is seen as a ?national security? issue,
they start to lobby the politicians to pass laws against it. Imagine
that: making too much use of the service, or prompting it according to certain forbidden patterns, could now become criminal violations of
laws such as the Economic Espionage Act and the Computer Fraud and
Abuse Act.

But of course, yet another way in which do destroy access to general
computing. It wasn't enough to ruin web search engines or websites in
general, they do need to make it a *crime* to actually try to use a
computer or the internet for anything other than a small list of
proprietary commercial services.

What, you, YOU THERE, are you using IRC and netnews? CATCH HIM!

And Helen Lovejoy will lend them a hand, for sure.

Another quote from the February article:

As long as an LLM is accessible to the public, no foolproof
technical barrier prevents a determined actor from doing the same
thing to someone else?s model over time (though rate-limiting
helps), which is exactly what Google says happened to Gemini.

Again, I'm not sure Google is understanding the ramifications of it
being public domain. Which, given how they (don't) handle copyfraud in
Google Books, isn't really surprising to me.

--
Nuno Silva

--- PyGate Linux v1.5.14
* Origin: Dragon's Lair, PyGate NNTP<>Fido Gate (3:633/10)

From Computer Nerd Kev@3:633/10 to All on Wednesday, April 29, 2026 08:26:16

Lawrence D'Oliveiro <ldo@nz.invalid> wrote:

It's not enough that the creators of the AI models have scraped
information from millions of websites to build their systems, now they
have to worry about competitors short-circuiting the process by
extracting information wholesale from their models, by making hundreds
or thousands of requests and combining the results. This is called a "distillation attack".

Great! Let the LLMs scrape each other instead of each one scraping
my websites every minute. That'd be a more sane way for it to work,
except all the investors hoping to buy into a future AI monopoly
would run for the hills.

--
__ __
#_ < |\| |< _#

--- PyGate Linux v1.5.14
* Origin: Dragon's Lair, PyGate NNTP<>Fido Gate (3:633/10)

From Kerr-Mudd, John@3:633/10 to All on Saturday, May 02, 2026 13:06:13

On 29 Apr 2026 08:26:16 +1000
not@telling.you.invalid (Computer Nerd Kev) wrote:

Lawrence D'Oliveiro <ldo@nz.invalid> wrote:

It's not enough that the creators of the AI models have scraped
information from millions of websites to build their systems, now they
have to worry about competitors short-circuiting the process by
extracting information wholesale from their models, by making hundreds
or thousands of requests and combining the results. This is called a "distillation attack".

Great! Let the LLMs scrape each other instead of each one scraping
my websites every minute. That'd be a more sane way for it to work,
except all the investors hoping to buy into a future AI monopoly
would run for the hills.

No chance; they'll scrape any and everything to the bone.

--
Bah, and indeed, Humbug

--- PyGate Linux v1.5.14
* Origin: Dragon's Lair, PyGate NNTP<>Fido Gate (3:633/10)

Who's Online
Recent Visitors
- Wang Bu
  Friday, May 01, 2026 22:16:00
  from Manila, Philippines via Telnet
- Guest
  Wednesday, April 29, 2026 07:07:25
  from Tempe, Az via Telnet
- Wang Bu
  Thursday, April 16, 2026 08:46:50
  from Manila, Philippines via Telnet
- Wang Bu
  Tuesday, April 14, 2026 20:50:57
  from Manila, Philippines via HTTPS

System Info

Sysop:	Jacob Catayoc
Location:	Pasay City, Metro Manila, Philippines
Users:	5
Nodes:	4 (0 / 4)
Uptime:	493846:36:36
Calls:	146
Files:	547
D/L today:	6 files (97K bytes)
Messages:	76,794

Phrase Of The Week: ?Distillation Attack?

Who's Online

Recent Visitors

System Info