• Phrase Of The Week: ?Distillation Attack?

    From Lawrence D?Oliveiro@3:633/10 to All on Tuesday, April 28, 2026 03:00:20
    It?s not enough that the creators of the AI models have scraped
    information from millions of websites to build their systems, now they
    have to worry about competitors short-circuiting the process by
    extracting information wholesale from their models, by making hundreds
    or thousands of requests and combining the results. This is called a ?distillation attack?. Some AI service providers try to say this is
    against their terms and conditions, but how do you define what is and
    isn?t allowed, exactly?

    <https://arstechnica.com/ai/2026/02/attackers-prompted-gemini-over-100000-times-while-trying-to-clone-it-google-says/>:

    In March 2023, shortly after Meta?s LLaMA model weights leaked
    online, Stanford University researchers built a model called
    Alpaca by fine-tuning LLaMA on 52,000 outputs generated by
    OpenAI?s GPT-3.5. The total cost was about $600. The result
    behaved so much like ChatGPT that it raised immediate questions
    about whether any AI model?s capabilities could be protected once
    it was accessible through an API.

    A big worry in the US now is that China is doing this to train its AI
    models <https://arstechnica.com/tech-policy/2026/04/us-accuses-china-of-industrial-scale-ai-theft-china-says-its-slander/>.
    And of course, as soon as it is seen as a ?national security? issue,
    they start to lobby the politicians to pass laws against it. Imagine
    that: making too much use of the service, or prompting it according to
    certain forbidden patterns, could now become criminal violations of
    laws such as the Economic Espionage Act and the Computer Fraud and
    Abuse Act.

    Another quote from the February article:

    As long as an LLM is accessible to the public, no foolproof
    technical barrier prevents a determined actor from doing the same
    thing to someone else?s model over time (though rate-limiting
    helps), which is exactly what Google says happened to Gemini.

    --- PyGate Linux v1.5.14
    * Origin: Dragon's Lair, PyGate NNTP<>Fido Gate (3:633/10)
  • From Richard Kettlewell@3:633/10 to All on Tuesday, April 28, 2026 08:41:22
    Lawrence D?Oliveiro <ldo@nz.invalid> writes:
    Another quote from the February article:

    As long as an LLM is accessible to the public, no foolproof
    technical barrier prevents a determined actor from doing the same
    thing to someone else?s model over time (though rate-limiting
    helps), which is exactly what Google says happened to Gemini.

    Google, OpenAI and friends are rolling in money, no doubt they can look
    after themselves. But there?s a more general question there which is how
    to solve the problem of distributed attack in general including spam,
    port scanning, DDoS, etc etc etc, and in fact also including large-scale
    web scraping to feed LLMs.

    Anecdotally at least, a big part of the problem currently is
    ?residential proxy networks?, i.e. hijacking of domestic Internet
    connections, which allows attackers to use huge numbers of independent
    IP addresses. If you block 1000 addresses today, another 1000 will be
    along tomorrow.

    A solution to that would be a service to humanity.

    --
    https://www.greenend.org.uk/rjk/

    --- PyGate Linux v1.5.14
    * Origin: Dragon's Lair, PyGate NNTP<>Fido Gate (3:633/10)
  • From Nuno Silva@3:633/10 to All on Tuesday, April 28, 2026 09:45:12
    On 2026-04-28, Lawrence D?Oliveiro wrote:

    It?s not enough that the creators of the AI models have scraped
    information from millions of websites to build their systems, now they
    have to worry about competitors short-circuiting the process by
    extracting information wholesale from their models, by making hundreds
    or thousands of requests and combining the results. This is called a ?distillation attack?. Some AI service providers try to say this is
    against their terms and conditions, but how do you define what is and
    isn?t allowed, exactly?

    First, it's of course allowed legally, on the same grounds they used to
    build it, in some countries it might even be illegal to impose
    restrictions on reusing information like that.

    Second, well, this is clearly a good test for the whole thing, either
    there is fairness and they're told "no can do", that they must handle
    the so-called "attack" themselves with no legal remedy, or they're told
    they also can't do DDoS attacks on the open web.

    Sadly, I think we know where will the USA lie on this...


    <https://arstechnica.com/ai/2026/02/attackers-prompted-gemini-over-100000-times-while-trying-to-clone-it-google-says/>:

    In March 2023, shortly after Meta?s LLaMA model weights leaked
    online, Stanford University researchers built a model called
    Alpaca by fine-tuning LLaMA on 52,000 outputs generated by
    OpenAI?s GPT-3.5. The total cost was about $600. The result
    behaved so much like ChatGPT that it raised immediate questions
    about whether any AI model?s capabilities could be protected once
    it was accessible through an API.

    It is all public domain. Maybe it'll take time for them to realize that
    that doesn't only mean they can get away with their derivative works...

    A big worry in the US now is that China is doing this to train its AI
    models <https://arstechnica.com/tech-policy/2026/04/us-accuses-china-of-industrial-scale-ai-theft-china-says-its-slander/>.
    And of course, as soon as it is seen as a ?national security? issue,
    they start to lobby the politicians to pass laws against it. Imagine
    that: making too much use of the service, or prompting it according to certain forbidden patterns, could now become criminal violations of
    laws such as the Economic Espionage Act and the Computer Fraud and
    Abuse Act.

    But of course, yet another way in which do destroy access to general
    computing. It wasn't enough to ruin web search engines or websites in
    general, they do need to make it a *crime* to actually try to use a
    computer or the internet for anything other than a small list of
    proprietary commercial services.

    What, you, YOU THERE, are you using IRC and netnews? CATCH HIM!

    And Helen Lovejoy will lend them a hand, for sure.


    Another quote from the February article:

    As long as an LLM is accessible to the public, no foolproof
    technical barrier prevents a determined actor from doing the same
    thing to someone else?s model over time (though rate-limiting
    helps), which is exactly what Google says happened to Gemini.

    Again, I'm not sure Google is understanding the ramifications of it
    being public domain. Which, given how they (don't) handle copyfraud in
    Google Books, isn't really surprising to me.

    --
    Nuno Silva

    --- PyGate Linux v1.5.14
    * Origin: Dragon's Lair, PyGate NNTP<>Fido Gate (3:633/10)
  • From Computer Nerd Kev@3:633/10 to All on Wednesday, April 29, 2026 08:26:16
    Lawrence D'Oliveiro <ldo@nz.invalid> wrote:
    It's not enough that the creators of the AI models have scraped
    information from millions of websites to build their systems, now they
    have to worry about competitors short-circuiting the process by
    extracting information wholesale from their models, by making hundreds
    or thousands of requests and combining the results. This is called a "distillation attack".

    Great! Let the LLMs scrape each other instead of each one scraping
    my websites every minute. That'd be a more sane way for it to work,
    except all the investors hoping to buy into a future AI monopoly
    would run for the hills.

    --
    __ __
    #_ < |\| |< _#

    --- PyGate Linux v1.5.14
    * Origin: Dragon's Lair, PyGate NNTP<>Fido Gate (3:633/10)
  • From Kerr-Mudd, John@3:633/10 to All on Saturday, May 02, 2026 13:06:13
    On 29 Apr 2026 08:26:16 +1000
    not@telling.you.invalid (Computer Nerd Kev) wrote:

    Lawrence D'Oliveiro <ldo@nz.invalid> wrote:
    It's not enough that the creators of the AI models have scraped
    information from millions of websites to build their systems, now they
    have to worry about competitors short-circuiting the process by
    extracting information wholesale from their models, by making hundreds
    or thousands of requests and combining the results. This is called a "distillation attack".

    Great! Let the LLMs scrape each other instead of each one scraping
    my websites every minute. That'd be a more sane way for it to work,
    except all the investors hoping to buy into a future AI monopoly
    would run for the hills.

    No chance; they'll scrape any and everything to the bone.

    --
    Bah, and indeed, Humbug

    --- PyGate Linux v1.5.14
    * Origin: Dragon's Lair, PyGate NNTP<>Fido Gate (3:633/10)