Use of an AI-assisted classification step (human-in-the-loop), (?)Please don?t start with that stuff here?
Hello,
I would like to share a conceptual idea for discussion, not a concrete implementation proposal.
One of the current challenges for large and long-lived projects like Debian is the accumulation of historical logs, archives, and public records that
may contain personal data (IPs, emails, names), especially for oldstable
and EOL releases.
My idea is a layered approach to data minimization:
1.
Strict retention periods for raw logs (for example 30?90 days).
2.
Automatic sanitization and anonymization of historical public records.
3.
Use of an AI-assisted classification step (human-in-the-loop), where:
-
Clear personal data is anonymized automatically.
-
Ambiguous cases are isolated for human review.
4.
Preservation of technical knowledge via summarized, signed incident
records, instead of keeping large volumes of raw personal data.
The goal would be to reduce GDPR exposure while keeping technical value, without rewriting history or removing useful information.
I am not proposing to implement this myself, only offering an idea that
could be discussed or explored in the future.
Thank you for your time.
Best regards,
pipo
On Wed, Jan 07, 2026 at 01:33:55AM -0300, pedro vezzosi wrote:
Hello,
I would like to share a conceptual idea for discussion, not a concrete implementation proposal.
One of the current challenges for large and long-lived projects likeDebian
is the accumulation of historical logs, archives, and public records that may contain personal data (IPs, emails, names), especially for oldstable and EOL releases.
My idea is a layered approach to data minimization:
1.
Strict retention periods for raw logs (for example 30?90 days).
2.
Automatic sanitization and anonymization of historical public records.
3.
Use of an AI-assisted classification step (human-in-the-loop), where:
I would rather make that: "protect personal data from artificial intelligence",
so the opposite of AI-assisted classification of personal data. Frankly, we should start erasing personal data before we no longer can.
-
Clear personal data is anonymized automatically.
-
Ambiguous cases are isolated for human review.
4.
Preservation of technical knowledge via summarized, signed incident
records, instead of keeping large volumes of raw personal data.
The goal would be to reduce GDPR exposure while keeping technical value, without rewriting history or removing useful information.
I am not proposing to implement this myself, only offering an idea that could be discussed or explored in the future.
Thank you for your time.
Best regards,
pipo
--
Thank you for your reply and for sharing your perspective.
I would like to clarify one point, because I may not have expressed myself >clearly.
My concern is not about having AI ?read? or analyze personal data as such.
I fully understand that this can itself create additional GDPR and ethical >risks. The point I was trying to raise comes more from an organizational >angle.
Given that there are currently no dedicated people in a GDPR-focused role,
my worry is that privacy-related work may end up being purely reactive,
with someone having to act as a ?firefighter? on top of their main >responsibilities. I was thinking about whether there could be more
proactive approaches to data minimization, so that fewer problematic
records exist in the first place.
I also noticed that there is a debian-ai mailing list, and since I am new
to Debian mailing lists, it is possible that this was not the most >appropriate list to bring up this idea. If so, I apologize for the noise
and appreciate the guidance.
| Sysop: | Jacob Catayoc |
|---|---|
| Location: | Pasay City, Metro Manila, Philippines |
| Users: | 5 |
| Nodes: | 4 (0 / 4) |
| Uptime: | 22:32:08 |
| Calls: | 117 |
| Calls today: | 117 |
| Files: | 367 |
| D/L today: |
560 files (257M bytes) |
| Messages: | 70,898 |
| Posted today: | 26 |