• Idea: Reducing GDPR risk via automated log and data minimization

    From pedro vezzosi@3:633/10 to All on Wednesday, January 07, 2026 05:40:01
    Hello,
    I would like to share a conceptual idea for discussion, not a concrete implementation proposal.
    One of the current challenges for large and long-lived projects like Debian
    is the accumulation of historical logs, archives, and public records that
    may contain personal data (IPs, emails, names), especially for oldstable
    and EOL releases.
    My idea is a layered approach to data minimization:
    1.
    Strict retention periods for raw logs (for example 30?90 days).
    2.
    Automatic sanitization and anonymization of historical public records.
    3.
    Use of an AI-assisted classification step (human-in-the-loop), where:
    -
    Clear personal data is anonymized automatically.
    -
    Ambiguous cases are isolated for human review.
    4.
    Preservation of technical knowledge via summarized, signed incident
    records, instead of keeping large volumes of raw personal data.
    The goal would be to reduce GDPR exposure while keeping technical value, without rewriting history or removing useful information.
    I am not proposing to implement this myself, only offering an idea that
    could be discussed or explored in the future.
    Thank you for your time.
    Best regards,
    pipo


    --- PyGate Linux v1.5.2
    * Origin: Dragon's Lair, PyGate NNTP<>Fido Gate (3:633/10)
  • From Antoine Le Gonidec@3:633/10 to All on Wednesday, January 07, 2026 17:50:01
    Le Wed, Jan 07, 2026 at 01:33:55AM -0300, pedro vezzosi a ‚crit :
    Use of an AI-assisted classification step (human-in-the-loop), (?)
    Please don?t start with that stuff here?


    --- PyGate Linux v1.5.2
    * Origin: Dragon's Lair, PyGate NNTP<>Fido Gate (3:633/10)
  • From Bart Martens@3:633/10 to All on Wednesday, January 07, 2026 18:30:01
    On Wed, Jan 07, 2026 at 01:33:55AM -0300, pedro vezzosi wrote:
    Hello,

    I would like to share a conceptual idea for discussion, not a concrete implementation proposal.

    One of the current challenges for large and long-lived projects like Debian is the accumulation of historical logs, archives, and public records that
    may contain personal data (IPs, emails, names), especially for oldstable
    and EOL releases.

    My idea is a layered approach to data minimization:

    1.

    Strict retention periods for raw logs (for example 30?90 days).
    2.

    Automatic sanitization and anonymization of historical public records.
    3.

    Use of an AI-assisted classification step (human-in-the-loop), where:

    I would rather make that: "protect personal data from artificial intelligence", so the opposite of AI-assisted classification of personal data. Frankly, we should start erasing personal data before we no longer can.

    -

    Clear personal data is anonymized automatically.
    -

    Ambiguous cases are isolated for human review.
    4.

    Preservation of technical knowledge via summarized, signed incident
    records, instead of keeping large volumes of raw personal data.

    The goal would be to reduce GDPR exposure while keeping technical value, without rewriting history or removing useful information.

    I am not proposing to implement this myself, only offering an idea that
    could be discussed or explored in the future.

    Thank you for your time.

    Best regards,
    pipo

    --

    --- PyGate Linux v1.5.2
    * Origin: Dragon's Lair, PyGate NNTP<>Fido Gate (3:633/10)
  • From pedro vezzosi@3:633/10 to All on Wednesday, January 07, 2026 19:20:01
    Thank you for your reply and for sharing your perspective.
    I would like to clarify one point, because I may not have expressed myself clearly.
    My concern is not about having AI ?read? or analyze personal data as such.
    I fully understand that this can itself create additional GDPR and ethical risks. The point I was trying to raise comes more from an organizational
    angle.
    Given that there are currently no dedicated people in a GDPR-focused role,
    my worry is that privacy-related work may end up being purely reactive,
    with someone having to act as a ?firefighter? on top of their main responsibilities. I was thinking about whether there could be more
    proactive approaches to data minimization, so that fewer problematic
    records exist in the first place.
    I am not claiming that my idea is the right solution, nor that Debian
    should use AI for this. I only wanted to express a concern about privacy,
    which I consider a very important value in Debian, and to share a possible angle for discussion.
    I also noticed that there is a debian-ai mailing list, and since I am new
    to Debian mailing lists, it is possible that this was not the most
    appropriate list to bring up this idea. If so, I apologize for the noise
    and appreciate the guidance.
    Thank you for taking the time to reply.
    Best regards,
    pipo
    El mi‚, 7 ene 2026 a las 14:11, Bart Martens (<bartm@debian.org>) escribi¢:
    On Wed, Jan 07, 2026 at 01:33:55AM -0300, pedro vezzosi wrote:
    Hello,

    I would like to share a conceptual idea for discussion, not a concrete implementation proposal.

    One of the current challenges for large and long-lived projects like
    Debian
    is the accumulation of historical logs, archives, and public records that may contain personal data (IPs, emails, names), especially for oldstable and EOL releases.

    My idea is a layered approach to data minimization:

    1.

    Strict retention periods for raw logs (for example 30?90 days).
    2.

    Automatic sanitization and anonymization of historical public records.
    3.

    Use of an AI-assisted classification step (human-in-the-loop), where:

    I would rather make that: "protect personal data from artificial intelligence",
    so the opposite of AI-assisted classification of personal data. Frankly, we should start erasing personal data before we no longer can.

    -

    Clear personal data is anonymized automatically.
    -

    Ambiguous cases are isolated for human review.
    4.

    Preservation of technical knowledge via summarized, signed incident
    records, instead of keeping large volumes of raw personal data.

    The goal would be to reduce GDPR exposure while keeping technical value, without rewriting history or removing useful information.

    I am not proposing to implement this myself, only offering an idea that could be discussed or explored in the future.

    Thank you for your time.

    Best regards,
    pipo

    --



    --- PyGate Linux v1.5.2
    * Origin: Dragon's Lair, PyGate NNTP<>Fido Gate (3:633/10)
  • From Gunnar Wolf@3:633/10 to All on Wednesday, January 07, 2026 22:00:01
    pedro vezzosi dijo [Wed, Jan 07, 2026 at 02:59:51PM -0300]:
    Thank you for your reply and for sharing your perspective.

    I would like to clarify one point, because I may not have expressed myself >clearly.

    My concern is not about having AI ?read? or analyze personal data as such.
    I fully understand that this can itself create additional GDPR and ethical >risks. The point I was trying to raise comes more from an organizational >angle.

    Given that there are currently no dedicated people in a GDPR-focused role,
    my worry is that privacy-related work may end up being purely reactive,
    with someone having to act as a ?firefighter? on top of their main >responsibilities. I was thinking about whether there could be more
    proactive approaches to data minimization, so that fewer problematic
    records exist in the first place.

    I saw several media outlets picked up Andreas' call to form again the Data Protection team. Don't take this as an issue that will take too long to be resolved: several DDs have already answered to his call, and I am confident
    a Data Protection Team will soon exist again.

    In the meantime... Well, most Debian Developers I know are extremely well
    aware and dilligent compared with population at large on this regard. I am confident we have a strong set of people to take care of these issues.

    I also noticed that there is a debian-ai mailing list, and since I am new
    to Debian mailing lists, it is possible that this was not the most >appropriate list to bring up this idea. If so, I apologize for the noise
    and appreciate the guidance.

    The debian-ai mailing list is about packaging AI-related software in a way amenable to our distribution. This would be the right list discussing non-technical aspects of project decisions.

    But yes, I agree with Antoine and Bart ? it is extremely unlikely our
    project would undertake large-scale analysis / classification / use of
    personal data as described in your original post, at least as we currently stand.


    ? Gunnar.

    --- PyGate Linux v1.5.2
    * Origin: Dragon's Lair, PyGate NNTP<>Fido Gate (3:633/10)