Forum: Jacob's Hideout BBS

Schr�dinger's hash

From Charles Curley@3:633/10 to All on Friday, May 22, 2026 18:00:01

I have four four terabyte hard drives. Each has a partition on it. The
four partitions comprise a RAID 5 array using mdadm. On top of that,
LUKS encryption, then LVM with ext4 logical volumes.

On one LVM partition I have a number of backup files, tarred,
bzipped, and sha256 and sha512 summed. I have a script which will find
checksum files, and execute the appropriate program to test the
archives. It puts each program into the background, parallising any
number of checksum tests.

Starting about a week ago, the script finds an error in one or more
files out of several. Results are inconsistent: one pass may find an
error in a given file, the next pass not find any errors in it. Running checksums manually, one at a time, does not turn up an error. Running
"tar tvf" finds no error in a suspect file. Running "bunzip2 -t" also
turns up no error. Only running the script turns up any errors.

I create two checksum files when I create the backups, for sha256 and
sha512. After this problem surfaced (about a week ago), I then made two
new checksum files of a suspect file. The two checksum file pairs
(e.g. both sha512sum files) show the same checksums. The script now
tests using both the old and new checksum files. Sometime only one pair
of checksum files fail the suspect file.

In addition to all of that, I also get the occasional "bad message"
error. I have no idea what that means, but an fsck seems to deal with
it.

To be thorough, I have run extended SMART tests on the hard drives,
kicked mdadm into testing the RAID array, and fscked the LVM partitions
on the RAID array. Only fsck turned up issues, and that has not stopped.

I also back some of this up to offsite USB drives. I ran the script on
one of those, using a different computer. No errors reported.

I have a hypothesis as to what is going on, but would like to hear from
you before I discuss it.

--
Does anybody read signatures any more?

https://charlescurley.com
https://charlescurley.com/blog/

--- PyGate Linux v1.5.14
* Origin: Dragon's Lair, PyGate NNTP<>Fido Gate (3:633/10)

From Andrew Latham@3:633/10 to All on Friday, May 22, 2026 18:10:02

I had an issue some months back. It turned out to be a bad RAM stick
in my NAS. The issues would not show up on a restart but after some
usage it would hit the RAM errors and :(

This may not be your issue, but I remember how annoying it was to figure ou
t.

On Fri, May 22, 2026 at 9:53?AM Charles Curley <charlescurley@charlescurley.com> wrote:

I have four four terabyte hard drives. Each has a partition on it. The
four partitions comprise a RAID 5 array using mdadm. On top of that,
LUKS encryption, then LVM with ext4 logical volumes.

On one LVM partition I have a number of backup files, tarred,
bzipped, and sha256 and sha512 summed. I have a script which will find checksum files, and execute the appropriate program to test the
archives. It puts each program into the background, parallising any
number of checksum tests.

Starting about a week ago, the script finds an error in one or more
files out of several. Results are inconsistent: one pass may find an
error in a given file, the next pass not find any errors in it. Running checksums manually, one at a time, does not turn up an error. Running
"tar tvf" finds no error in a suspect file. Running "bunzip2 -t" also
turns up no error. Only running the script turns up any errors.

I create two checksum files when I create the backups, for sha256 and
sha512. After this problem surfaced (about a week ago), I then made two
new checksum files of a suspect file. The two checksum file pairs
(e.g. both sha512sum files) show the same checksums. The script now
tests using both the old and new checksum files. Sometime only one pair
of checksum files fail the suspect file.

In addition to all of that, I also get the occasional "bad message"
error. I have no idea what that means, but an fsck seems to deal with
it.

To be thorough, I have run extended SMART tests on the hard drives,
kicked mdadm into testing the RAID array, and fscked the LVM partitions
on the RAID array. Only fsck turned up issues, and that has not stopped.

I also back some of this up to offsite USB drives. I ran the script on
one of those, using a different computer. No errors reported.

I have a hypothesis as to what is going on, but would like to hear from
you before I discuss it.

--
Does anybody read signatures any more?

https://charlescurley.com
https://charlescurley.com/blog/

--
- Andrew "lathama" Latham -

--- PyGate Linux v1.5.14
* Origin: Dragon's Lair, PyGate NNTP<>Fido Gate (3:633/10)

From The Wanderer@3:633/10 to All on Friday, May 22, 2026 18:20:01

On 2026-05-22 at 11:53, Charles Curley wrote:

Starting about a week ago, the script finds an error in one or more
files out of several. Results are inconsistent: one pass may find an
error in a given file, the next pass not find any errors in it.
Running checksums manually, one at a time, does not turn up an error.
Running "tar tvf" finds no error in a suspect file. Running "bunzip2
-t" also turns up no error. Only running the script turns up any
errors.

I create two checksum files when I create the backups, for sha256
and sha512. After this problem surfaced (about a week ago), I then
made two new checksum files of a suspect file. The two checksum file
pairs (e.g. both sha512sum files) show the same checksums. The script
now tests using both the old and new checksum files. Sometime only
one pair of checksum files fail the suspect file.

In addition to all of that, I also get the occasional "bad message"
error. I have no idea what that means, but an fsck seems to deal
with it.

Just for clarity: where (from what source), and when (at what point), is
it that you get this error?

To be thorough, I have run extended SMART tests on the hard drives,
kicked mdadm into testing the RAID array, and fscked the LVM
partitions on the RAID array. Only fsck turned up issues, and that
has not stopped.
I have a hypothesis as to what is going on, but would like to hear
from you before I discuss it.

The very first thing that came to my mind out of that was RAM issues.
Disk issues was the second, but the tests you've run there seem as if
they'd probably have ruled that out.
If you run a script to generate the hash of a given file in a loop
(possibly with a don't-overload-the-system pause in between if you
prefer), does it always show the same hash, or does it sometimes show a different one?
--
The Wanderer
The reasonable man adapts himself to the world; the unreasonable one
persists in trying to adapt the world to himself. Therefore all
progress depends on the unreasonable man. -- George Bernard Shaw

--- PyGate Linux v1.5.14
* Origin: Dragon's Lair, PyGate NNTP<>Fido Gate (3:633/10)

From Charles Curley@3:633/10 to All on Friday, May 22, 2026 18:20:01

On Fri, 22 May 2026 10:05:56 -0600
Andrew Latham <lathama@gmail.com> wrote:

I had an issue some months back. It turned out to be a bad RAM stick
in my NAS. The issues would not show up on a restart but after some
usage it would hit the RAM errors and :(

This is not impossible. I recently had some RAM go bad, failing
memtest. I have replaced it with new RAM, which does not fail
memtest. Maybe I should let it run for several passes.

Thanks.

--
Does anybody read signatures any more?

https://charlescurley.com
https://charlescurley.com/blog/

--- PyGate Linux v1.5.14
* Origin: Dragon's Lair, PyGate NNTP<>Fido Gate (3:633/10)

From Andrew Latham@3:633/10 to All on Friday, May 22, 2026 18:30:01

Yes, I should have added that this RAM was only failing when warm/hot
which was not fun to discover.

On Fri, May 22, 2026 at 10:14?AM Charles Curley <charlescurley@charlescurley.com> wrote:

On Fri, 22 May 2026 10:05:56 -0600
Andrew Latham <lathama@gmail.com> wrote:

I had an issue some months back. It turned out to be a bad RAM stick
in my NAS. The issues would not show up on a restart but after some
usage it would hit the RAM errors and :(

This is not impossible. I recently had some RAM go bad, failing
memtest. I have replaced it with new RAM, which does not fail
memtest. Maybe I should let it run for several passes.

Thanks.

--
Does anybody read signatures any more?

https://charlescurley.com
https://charlescurley.com/blog/

--
- Andrew "lathama" Latham -

--- PyGate Linux v1.5.14
* Origin: Dragon's Lair, PyGate NNTP<>Fido Gate (3:633/10)

From Charles Curley@3:633/10 to All on Friday, May 22, 2026 19:00:01

On Fri, 22 May 2026 10:21:28 -0600
Andrew Latham <lathama@gmail.com> wrote:

Yes, I should have added that this RAM was only failing when warm/hot
which was not fun to discover.

Hmmm, I wonder if memtest would stress the RAM enough to get it hot. Interesting.

I do have a handheld infrared thermometer, which I mostly use for
cooking. But it would be perfect for the occasional RAM stress test.

--
Does anybody read signatures any more?

https://charlescurley.com
https://charlescurley.com/blog/

--- PyGate Linux v1.5.14
* Origin: Dragon's Lair, PyGate NNTP<>Fido Gate (3:633/10)

From Charles Curley@3:633/10 to All on Friday, May 22, 2026 19:10:01

On Fri, 22 May 2026 12:19:08 -0400
The Wanderer <wanderer@fastmail.fm> wrote:

The very first thing that came to my mind out of that was RAM issues.
Disk issues was the second, but the tests you've run there seem as if
they'd probably have ruled that out.

I agree that the tests I've run so far would tend to rule out disk
issues. I just started another set of extended self tests, so we'll see
where that goes. The longest should take about ten hours.

If you run a script to generate the hash of a given file in a loop
(possibly with a don't-overload-the-system pause in between if you
prefer), does it always show the same hash, or does it sometimes show
a different one?

I haven't done such a script. But a freshly generated hash of a suspect
file will agree with a hash created when the suspect file was created,
which may have been five or seven years ago. At least so far.

--
Does anybody read signatures any more?

https://charlescurley.com
https://charlescurley.com/blog/

--- PyGate Linux v1.5.14
* Origin: Dragon's Lair, PyGate NNTP<>Fido Gate (3:633/10)

From nwe@3:633/10 to All on Friday, May 22, 2026 19:40:01

On 5/22/26 11:05 AM, Andrew Latham wrote:

I had an issue some months back. It turned out to be a bad RAM stick
in my NAS.

May I ask, was this ECC RAM?

My personal NAS contains 16 sticks of 16GB DDR3 registered RAM. It is
logging a CE memory scrubbing error once or twice a day since 700+ days.

It is always the same page/address, triggering a soft offline of that
memory page.

At some point I mean to take the time to figure out which RAM stick is
the culprit. I am aware of one unrecoverable RAM error which I
discovered one morning in bios logs upon investigating why this server unexpectedly collapsed and restarted once at night something like 2
years ago.

If I knew there was harm in postponing the repair I might prioritize it.

--- PyGate Linux v1.5.14
* Origin: Dragon's Lair, PyGate NNTP<>Fido Gate (3:633/10)

From David Christensen@3:633/10 to All on Friday, May 22, 2026 20:10:01

On 5/22/26 08:53, Charles Curley wrote:

I have four four terabyte hard drives. Each has a partition on it. The
four partitions comprise a RAID 5 array using mdadm. On top of that,
LUKS encryption, then LVM with ext4 logical volumes.

On one LVM partition I have a number of backup files, tarred,
bzipped, and sha256 and sha512 summed. I have a script which will find checksum files, and execute the appropriate program to test the
archives. It puts each program into the background, parallising any
number of checksum tests.

Starting about a week ago, the script finds an error in one or more
files out of several. Results are inconsistent: one pass may find an
error in a given file, the next pass not find any errors in it. Running checksums manually, one at a time, does not turn up an error. Running
"tar tvf" finds no error in a suspect file. Running "bunzip2 -t" also
turns up no error. Only running the script turns up any errors.

I create two checksum files when I create the backups, for sha256 and
sha512. After this problem surfaced (about a week ago), I then made two
new checksum files of a suspect file. The two checksum file pairs
(e.g. both sha512sum files) show the same checksums. The script now
tests using both the old and new checksum files. Sometime only one pair
of checksum files fail the suspect file.

In addition to all of that, I also get the occasional "bad message"
error. I have no idea what that means, but an fsck seems to deal with
it.

To be thorough, I have run extended SMART tests on the hard drives,
kicked mdadm into testing the RAID array, and fscked the LVM partitions
on the RAID array. Only fsck turned up issues, and that has not stopped.

I also back some of this up to offsite USB drives. I ran the script on
one of those, using a different computer. No errors reported.

I have a hypothesis as to what is going on, but would like to hear from
you before I discuss it.

On 5/22/26 09:05, Andrew Latham wrote:

I had an issue some months back. It turned out to be a bad RAM stick
in my NAS. The issues would not show up on a restart but after some
usage it would hit the RAM errors and :(

On 5/22/26 09:14, Charles Curley wrote:

This is not impossible. I recently had some RAM go bad, failing
memtest. I have replaced it with new RAM, which does not fail
memtest. Maybe I should let it run for several passes.

When I suspect a memory problem, I run Memtest86+ for 24+ hours:

https://memtest.org/

Linux ISO (64 bits) -> mt86plus_8.10_x86_64.iso.zip

The current version sets "CPU Sequencing Mode" to "Parallel (PAR)" by
default.

I use and suggest ECC memory.

I use and suggest ZFS with redundant disks for storage.

David

--- PyGate Linux v1.5.14
* Origin: Dragon's Lair, PyGate NNTP<>Fido Gate (3:633/10)

From Van Snyder@3:633/10 to All on Friday, May 22, 2026 20:40:01

On Fri, 2026-05-22 at 09:53 -0600, Charles Curley wrote:

I have four four terabyte hard drives. Each has a partition on it.
The
four partitions comprise a RAID 5 array using mdadm. On top of that,
LUKS encryption, then LVM with ext4 logical volumes.

I remarked to a local computer repair shop that I have a four TB backup
drive. He said "replace it. Four TB isn't ready yet."

On one LVM partition I have a number of backup files, tarred,
bzipped, and sha256 and sha512 summed. I have a script which will
find
checksum files, and execute the appropriate program to test the
archives. It puts each program into the background, parallising any
number of checksum tests.

Starting about a week ago, the script finds an error in one or more
files out of several. Results are inconsistent: one pass may find an
error in a given file, the next pass not find any errors in it.
Running
checksums manually, one at a time, does not turn up an error. Running
"tar tvf" finds no error in a suspect file. Running "bunzip2 -t" also
turns up no error. Only running the script turns up any errors.

I create two checksum files when I create the backups, for sha256 and
sha512. After this problem surfaced (about a week ago), I then made
two
new checksum files of a suspect file. The two checksum file pairs
(e.g. both sha512sum files) show the same checksums. The script now
tests using both the old and new checksum files. Sometime only one
pair
of checksum files fail the suspect file.

In addition to all of that, I also get the occasional "bad message"
error. I have no idea what that means, but an fsck seems to deal with
it.

To be thorough, I have run extended SMART tests on the hard drives,
kicked mdadm into testing the RAID array, and fscked the LVM
partitions
on the RAID array. Only fsck turned up issues, and that has not
stopped.

I also back some of this up to offsite USB drives. I ran the script
on
one of those, using a different computer. No errors reported.

I have a hypothesis as to what is going on, but would like to hear
from
you before I discuss it.

--- PyGate Linux v1.5.14
* Origin: Dragon's Lair, PyGate NNTP<>Fido Gate (3:633/10)

From nwe@3:633/10 to All on Friday, May 22, 2026 21:50:01

On 5/22/26 1:32 PM, Van Snyder wrote:

I remarked to a local computer repair shop that I have a four TB
backup drive. He said "replace it. Four TB isn't ready yet."

How so? I though 4TB is showing its age...

I'm running 12x 4TB drives. Used SAS drives. Accumulated power on time
ranges from 40,166 to 73,439 hours.

Smartctl informs me device /dev/sdf is worsening with increased read
errors over time. That one shows 73408 hours powered up, 72360.67 GB
read, 119545.193 GB written, 195 power cycles (13 since July 13 2024).
Defect list increased from 3 to 6872.

I see two other drives have defect lists of 23 and 14, respectively. All others are at 0. Considering that, I should probably prioritize
replacing at least sdf soon to avoid losing redundancy during resilver, considering the age of the pool.

nwe@srv01:~$ zpool status -c vendor,model,size
pool: POOL1
state: ONLINE
scan: scrub repaired 0B in 04:10:05 with 0 errors on Sun May 10 04:34:06
2026
config:

NAME� � � � STATE� � �READ WRITE CKSUM� �vendor� � � � �model� size
POOL1� � � �ONLINE� � � �0� � �0� � �0
raidz3-0� ONLINE� � � �0� � �0� � �0
sdb� � �ONLINE� � � �0� � �0� � �0� SEAGATE� ST4000NM0023� 3.6T
sdc� � �ONLINE� � � �0� � �0� � �0� TOSHIBA� �MG04SCA40EN� 3.6T
sdd� � �ONLINE� � � �0� � �0� � �0� TOSHIBA� �MG04SCA40EN� 3.6T
sde� � �ONLINE� � � �0� � �0� � �0� SEAGATE� ST4000NM0023� 3.6T
sdf� � �ONLINE� � � �0� � �0� � �0� SEAGATE� ST4000NM0023� 3.6T
sdh� � �ONLINE� � � �0� � �0� � �0� � � �HP� �MB4000FCWDK� 3.6T
sdg� � �ONLINE� � � �0� � �0� � �0� � � �HP� �MB4000FCWDK� 3.6T
sdi� � �ONLINE� � � �0� � �0� � �0� SEAGATE� ST4000NM0023� 3.6T
sdj� � �ONLINE� � � �0� � �0� � �0� SEAGATE� ST4000NM0023� 3.6T
sdk� � �ONLINE� � � �0� � �0� � �0� SEAGATE� ST4000NM0023� 3.6T
sdl� � �ONLINE� � � �0� � �0� � �0� � � �HP� �MB4000FCWDK� 3.6T
sdm� � �ONLINE� � � �0� � �0� � �0� � � �HP� �MB4000FCWDK� 3.6T

errors: No known data errors

--- PyGate Linux v1.5.14
* Origin: Dragon's Lair, PyGate NNTP<>Fido Gate (3:633/10)

From Charles Curley@3:633/10 to All on Friday, May 22, 2026 22:10:01

On Fri, 22 May 2026 12:34:01 -0500
nwe <nwe@gitcoding.net> wrote:

May I ask, was this ECC RAM?

Mine is not.

Handle 0x0044, DMI type 17, 40 bytes
Memory Device
Array Handle: 0x0040
Error Information Handle: Not Provided
Total Width: 64 bits
Data Width: 64 bits
Size: 8 GB
Form Factor: DIMM
Set: None
Locator: DIMM_B2
Bank Locator: BANK 3
Type: DDR3
Type Detail: Synchronous
Speed: 1600 MT/s
Manufacturer: 0420
Serial Number: 00000000
Asset Tag: 9876543210
Part Number: F3-1600C9-8GAB
Rank: 2
Configured Memory Speed: 1600 MT/s
Minimum Voltage: 1.5 V
Maximum Voltage: 1.5 V
Configured Voltage: 1.5 V

My personal NAS contains 16 sticks of 16GB DDR3 registered RAM. It is logging a CE memory scrubbing error once or twice a day since 700+
days.

Where would one find such evidence? I imagine something like:

journalctl | grep -i RAM

but nothing that produces jumps out at me.

It is always the same page/address, triggering a soft offline of that
memory page.

--
Does anybody read signatures any more?

https://charlescurley.com
https://charlescurley.com/blog/

--- PyGate Linux v1.5.14
* Origin: Dragon's Lair, PyGate NNTP<>Fido Gate (3:633/10)

From nwe@3:633/10 to All on Friday, May 22, 2026 22:30:05

On 5/22/26 3:02 PM, Charles Curley wrote:

journalctl | grep -i RAM

Sure enough, that gets me a boatload of RAM error reports on my server.
On my desktop without ECC it does not. I think no noise = good, however,
I have rasdaemon installed on the server, I think it may take a
combination of that + ECC to make the RAM errors log. It's been a while
since I set this up. I think I had to change a setting in the Dell bios
to prevent its log from eating the error instead of handing it to the os.

I was simply reading sudo dmesg.

If I'm correct, memtest86 is nearly useless on ECC RAM.

--- PyGate Linux v1.5.14
* Origin: Dragon's Lair, PyGate NNTP<>Fido Gate (3:633/10)

From Charles Curley@3:633/10 to All on Friday, May 22, 2026 23:10:02

On Fri, 22 May 2026 15:22:05 -0500
nwe <nwe@gitcoding.net> wrote:

On 5/22/26 3:02 PM, Charles Curley wrote:

journalctl | grep -i RAM

Sure enough, that gets me a boatload of RAM error reports on my
server.

But not one on my desktop. I also went back several boots, and still no
errors.

On my desktop without ECC it does not.

Possibly because without ECC one has no way to detect the errors. I
don't know enough about modern RAM to be sure, so that's a guess.

I think no noise = good,

Not always. Maybe if you have ECC.

however, I have rasdaemon installed on the server, I think it may
take a combination of that + ECC to make the RAM errors log.

That is consistent with the description provided by "apt show
rasdaemon".

It's been a while since I set this up. I think I had
to change a setting in the Dell bios to prevent its log from eating
the error instead of handing it to the os.

I was simply reading sudo dmesg.

If I'm correct, memtest86 is nearly useless on ECC RAM.

Maybe.

"MemTest86 directly polls ECC errors logged in the chipset/memory
controller registers and displays it to the user on-screen. In
addition, ECC errors are written to the log and report file.

"During testing, MemTest86 may report ECC errors detected by the memory controller if ECC is supported and enabled. This is demonstrated in the following screenshot:"

https://www.memtest86.com/ecc.htm#memtest86

--
Does anybody read signatures any more?

https://charlescurley.com
https://charlescurley.com/blog/

--- PyGate Linux v1.5.14
* Origin: Dragon's Lair, PyGate NNTP<>Fido Gate (3:633/10)

From nwe@3:633/10 to All on Friday, May 22, 2026 23:30:01

On 5/22/26 4:02 PM, Charles Curley wrote:

If I'm correct, memtest86 is nearly useless on ECC RAM.

Maybe.

"MemTest86 directly polls ECC errors logged in the chipset/memory
controller registers and displays it to the user on-screen. In
addition, ECC errors are written to the log and report file.

"During testing, MemTest86 may report ECC errors detected by the memory controller if ECC is supported and enabled. This is demonstrated in the following screenshot:"

https://www.memtest86.com/ecc.htm#memtest86

That memtest86 info is more useful than I remembered.

It's been too long since I studied the subject, more is coming back. The
bios setting in my Dell R720XD rack server had something to do with a
choice between having the hardware handle ECC versus allowing the os to control it. Default setting was hardware, at which point the underlying
ECC corrections/faults seemed hidden/inaccessible from the os side of
the memory controller and only appeared in bios logs.

I just now tried to find it, but suspect I would have to take a look at
it next time I'm in the server's bios settings. Which is not often. It
is a matter of connecting a monitor and keyboard directly to it and
rebooting, it takes it a couple minutes to post.

--- PyGate Linux v1.5.14
* Origin: Dragon's Lair, PyGate NNTP<>Fido Gate (3:633/10)

From David Christensen@3:633/10 to All on Saturday, May 23, 2026 01:30:01

On 5/22/26 12:43, nwe wrote:

On 5/22/26 1:32 PM, Van Snyder wrote:

I remarked to a local computer repair shop that I have a four TB
backup drive. He said "replace it. Four TB isn't ready yet."

How so? I though 4TB is showing its age...

+1 I am also curious why 4 TB HDD's are not "ready yet".

I'm running 12x 4TB drives. Used SAS drives. Accumulated power on time ranges from 40,166 to 73,439 hours.

Smartctl informs me device /dev/sdf is worsening with increased read
errors over time. That one shows 73408 hours powered up, 72360.67 GB
read, 119545.193 GB written, 195 power cycles (13 since July 13 2024). Defect list increased from 3 to 6872.

I see two other drives have defect lists of 23 and 14, respectively. All others are at 0. Considering that, I should probably prioritize
replacing at least sdf soon to avoid losing redundancy during resilver, considering the age of the pool.

I am still trying to understand the smartctl(8) "SMART Attributes Data Structure". The RAW_VALUE seems to be a binary bit field (?) for
several attributes and is useless without manufacturer engineering data.
The VALUE column is supposed to be a percentage that starts at 100%
and goes down to 0% as the disk wears out:

* Raw_Read_Error_Rate, Seek_Error_Rate, and Hardware_ECC_Recovered can
have low VALUE numbers, but the disk seems to keep working.

* Low VALUE numbers for Reallocated_Sector_Ct, Current_Pending_Sector,
and/or Offline_Uncorrectable seem to be reliable indicators of a failing
disk.

* I have not seen a VALUE number other than 100% for End-to-End_Error.

nwe@srv01:~$ zpool status -c vendor,model,size
pool: POOL1
state: ONLINE
scan: scrub repaired 0B in 04:10:05 with 0 errors on Sun May 10 04:34:06 2026
config:

NAME� � � � STATE� � �READ WRITE CKSUM� �vendor� � � � �model� size
POOL1� � � �ONLINE� � � �0� � �0� � �0
raidz3-0� ONLINE� � � �0� � �0� � �0
sdb� � �ONLINE� � � �0� � �0� � �0� SEAGATE� ST4000NM0023� 3.6T
sdc� � �ONLINE� � � �0� � �0� � �0� TOSHIBA� �MG04SCA40EN� 3.6T
sdd� � �ONLINE� � � �0� � �0� � �0� TOSHIBA� �MG04SCA40EN� 3.6T
sde� � �ONLINE� � � �0� � �0� � �0� SEAGATE� ST4000NM0023� 3.6T
sdf� � �ONLINE� � � �0� � �0� � �0� SEAGATE� ST4000NM0023� 3.6T
sdh� � �ONLINE� � � �0� � �0� � �0� � � �HP� �MB4000FCWDK� 3.6T
sdg� � �ONLINE� � � �0� � �0� � �0� � � �HP� �MB4000FCWDK� 3.6T
sdi� � �ONLINE� � � �0� � �0� � �0� SEAGATE� ST4000NM0023� 3.6T
sdj� � �ONLINE� � � �0� � �0� � �0� SEAGATE� ST4000NM0023� 3.6T
sdk� � �ONLINE� � � �0� � �0� � �0� SEAGATE� ST4000NM0023� 3.6T
sdl� � �ONLINE� � � �0� � �0� � �0� � � �HP� �MB4000FCWDK� 3.6T
sdm� � �ONLINE� � � �0� � �0� � �0� � � �HP� �MB4000FCWDK� 3.6T

errors: No known data errors

Twelve disks gives you many choices for how to layout the pool and
trade-off redundancy vs. capacity vs. performance. Is the data balanced across disks? Does the machine have enough memory? Is the ARC working
well?

On two of my earlier pools, I added a 60 GB SSD as a cache vdev after
the pools had data. I did not notice any improvement.

On one of my earlier pools of one mirror of two 3 TB HDD's that was
nearly full, I added another mirror of two 3 TB HDD's. I did not notice
any improvement.

I rebuilt the storage pool with two mirrors of two 3 TB HDD's each and a special vdev mirror of two 180 GB SSD's. I also set
special_small_blocks=16K. I then restored the data via replication.
The data is now balanced across disks, latency has dropped, throughput
has increased, and overall performance is noticeably better:

2026-05-22 15:12:45 toor@f5 ~
# freebsd-version
13.5-RELEASE-p12

2026-05-22 15:19:47 toor@f5 ~
# zpool iostat -v p5
capacity operations bandwidth
pool alloc free read write read write
---------------- ----- ----- ----- ----- ----- -----
p5 3.76T 1.82T 6 1 3.68M 32.2K
mirror-0 1.87T 871G 2 0 1.82M 4.48K
gpt/hdd0.eli - - 1 0 931K 2.24K
gpt/hdd1.eli - - 1 0 931K 2.24K
mirror-1 1.86T 876G 2 0 1.81M 4.35K
gpt/hdd2.eli - - 1 0 928K 2.18K
gpt/hdd3.eli - - 1 0 928K 2.18K
special - - - - - -
mirror-2 31.1G 118G 1 1 51.2K 23.3K
gpt/ssd0.eli - - 0 0 25.6K 11.7K
gpt/ssd1.eli - - 0 0 25.6K 11.7K
---------------- ----- ----- ----- ----- ----- ----- -----

2026-05-22 15:32:42 toor@f5 ~
# top -d 1 | head -n 7
last pid: 57622; load averages: 0.24, 0.21, 0.17 up 24+22:47:05
15:32:45
27 processes: 1 running, 26 sleeping
CPU: 0.0% user, 0.0% nice, 0.6% system, 0.0% interrupt, 99.4% idle
Mem: 4848K Active, 330M Inact, 856K Laundry, 14G Wired, 920M Buf, 694M Free ARC: 12G Total, 10G MFU, 485M MRU, 3328K Anon, 200M Header, 899M Other
9921M Compressed, 33G Uncompressed, 3.36:1 Ratio
Swap: 764M Total, 764M Free

2026-05-22 15:33:12 toor@f5 ~
# arc_summary | grep -A 5 "ARC total accesses"
ARC total accesses (hits + misses): 512.7M
Cache hit ratio: 99.8 % 511.8M
Cache miss ratio: 0.2 % 886.5k
Actual hit ratio (MFU + MRU hits): 99.3 % 509.2M
Data demand efficiency: 99.5 % 4.8M
Data prefetch efficiency: 19.2 % 96.9k

In hindsight:

1. I gathered file system statistics prior rebuilding the pool and
setting special_small_blocks=16K, but it now appears I could have used a larger value.

2. If I get worried about HDD's failing, I can add disks to the pool as spares and/or add disks to the data mirrors. The latter should increase
read performance even more.

3. My ~10 year old HDD's can already saturate Gigabit with sequential
I/O. RAID 10 with SSD acceleration is even more overkill. I want 10 GbE.

David

--- PyGate Linux v1.5.14
* Origin: Dragon's Lair, PyGate NNTP<>Fido Gate (3:633/10)

From Van Snyder@3:633/10 to All on Saturday, May 23, 2026 01:40:01

On Fri, 2026-05-22 at 16:25 -0700, David Christensen wrote:

On 5/22/26 12:43, nwe wrote:

On 5/22/26 1:32 PM, Van Snyder wrote:

I remarked to a local computer repair shop that I have a four TB
backup drive. He said "replace it. Four TB isn't ready yet."

How so? I though 4TB is showing its age...

+1� I am also curious why 4 TB HDD's are not "ready yet".

The repair guy who made that remark to me didn't explain why he
believed it. Maybe he's just not keeping up to what's happening. My WDC WD40EDAZ-11CFPB0 reports 68 power-on resets, zero read errors, and zero
seek errors, but with only 2,791 power-on hours.

I'm running 12x 4TB drives. Used SAS drives. Accumulated power on
time
ranges from 40,166 to 73,439 hours.

Smartctl informs me device /dev/sdf is worsening with increased
read
errors over time. That one shows 73408 hours powered up, 72360.67
GB
read, 119545.193 GB written, 195 power cycles (13 since July 13
2024).
Defect list increased from 3 to 6872.

I see two other drives have defect lists of 23 and 14,
respectively. All
others are at 0. Considering that, I should probably prioritize
replacing at least sdf soon to avoid losing redundancy during
resilver,
considering the age of the pool.

I am still trying to understand the smartctl(8) "SMART Attributes
Data
Structure".� The RAW_VALUE seems to be a binary bit field (?) for
several attributes and is useless without manufacturer engineering
data.
� The VALUE column is supposed to be a percentage that starts at 100%
and goes down to 0% as the disk wears out:

* Raw_Read_Error_Rate, Seek_Error_Rate, and Hardware_ECC_Recovered
can
have low VALUE numbers, but the disk seems to keep working.

* Low VALUE numbers for Reallocated_Sector_Ct,
Current_Pending_Sector,
and/or Offline_Uncorrectable seem to be reliable indicators of a
failing
disk.

* I have not seen a VALUE number other than 100% for End-to-
End_Error.

nwe@srv01:~$ zpool status -c vendor,model,size
pool: POOL1
state: ONLINE
scan: scrub repaired 0B in 04:10:05 with 0 errors on Sun May 10
04:34:06
2026
config:

NAME� � � � STATE� � �READ WRITE CKSUM� �vendor� � � � �model� size
POOL1� � � �ONLINE� � � �0� � �0� � �0
raidz3-0� ONLINE� � � �0� � �0� � �0
sdb� � �ONLINE� � � �0� � �0� � �0� SEAGATE� ST4000NM0023� 3.6T
sdc� � �ONLINE� � � �0� � �0� � �0� TOSHIBA� �MG04SCA40EN� 3.6T
sdd� � �ONLINE� � � �0� � �0� � �0� TOSHIBA� �MG04SCA40EN� 3.6T
sde� � �ONLINE� � � �0� � �0� � �0� SEAGATE� ST4000NM0023� 3.6T
sdf� � �ONLINE� � � �0� � �0� � �0� SEAGATE� ST4000NM0023� 3.6T
sdh� � �ONLINE� � � �0� � �0� � �0� � � �HP� �MB4000FCWDK� 3.6T
sdg� � �ONLINE� � � �0� � �0� � �0� � � �HP� �MB4000FCWDK� 3.6T
sdi� � �ONLINE� � � �0� � �0� � �0� SEAGATE� ST4000NM0023� 3.6T
sdj� � �ONLINE� � � �0� � �0� � �0� SEAGATE� ST4000NM0023� 3.6T
sdk� � �ONLINE� � � �0� � �0� � �0� SEAGATE� ST4000NM0023� 3.6T
sdl� � �ONLINE� � � �0� � �0� � �0� � � �HP� �MB4000FCWDK� 3.6T
sdm� � �ONLINE� � � �0� � �0� � �0� � � �HP� �MB4000FCWDK� 3.6T

errors: No known data errors

Twelve disks gives you many choices for how to layout the pool and
trade-off redundancy vs. capacity vs. performance.� Is the data
balanced
across disks?� Does the machine have enough memory?� Is the ARC
working
well?

On two of my earlier pools, I added a 60 GB SSD as a cache vdev after
the pools had data.� I did not notice any improvement.

On one of my earlier pools of one mirror of two 3 TB HDD's that was
nearly full, I added another mirror of two 3 TB HDD's.� I did not
notice
any improvement.

I rebuilt the storage pool with two mirrors of two 3 TB HDD's each
and a
special vdev mirror of two 180 GB SSD's.� I also set special_small_blocks=16K.� I then restored the data via replication.
The data is now balanced across disks, latency has dropped,
throughput
has increased, and overall performance is noticeably better:

2026-05-22 15:12:45 toor@f5 ~
# freebsd-version
13.5-RELEASE-p12

2026-05-22 15:19:47 toor@f5 ~
# zpool iostat -v p5
�� capacity�� operations�� bandwidth
pool�� alloc�� free�� read� write�� read� write
----------------� -----� -----� -----� -----� -----� -----
p5�� 3.76T� 1.82T�� 6�� 1� 3.68M� 32.2K
�� mirror-0�� 1.87T�� 871G�� 2�� 0� 1.82M� 4.48K
�� gpt/hdd0.eli�� -�� -�� 1�� 0�� 931K� 2.24K
�� gpt/hdd1.eli�� -�� -�� 1�� 0�� 931K� 2.24K
�� mirror-1�� 1.86T�� 876G�� 2�� 0� 1.81M� 4.35K
�� gpt/hdd2.eli�� -�� -�� 1�� 0�� 928K� 2.18K
�� gpt/hdd3.eli�� -�� -�� 1�� 0�� 928K� 2.18K
special�� -�� -�� -�� -�� -�� -
�� mirror-2�� 31.1G�� 118G�� 1�� 1� 51.2K� 23.3K
�� gpt/ssd0.eli�� -�� -�� 0�� 0� 25.6K� 11.7K
�� gpt/ssd1.eli�� -�� -�� 0�� 0� 25.6K� 11.7K
----------------� -----� -----� -----� -----� -----� -----� -----

2026-05-22 15:32:42 toor@f5 ~
# top -d 1 | head -n 7
last pid: 57622;� load averages:� 0.24,� 0.21,� 0.17� up 24+22:47:05 15:32:45
27 processes:� 1 running, 26 sleeping
CPU:� 0.0% user,� 0.0% nice,� 0.6% system,� 0.0% interrupt, 99.4%
idle
Mem: 4848K Active, 330M Inact, 856K Laundry, 14G Wired, 920M Buf,
694M Free
ARC: 12G Total, 10G MFU, 485M MRU, 3328K Anon, 200M Header, 899M
Other
�� 9921M Compressed, 33G Uncompressed, 3.36:1 Ratio
Swap: 764M Total, 764M Free

2026-05-22 15:33:12 toor@f5 ~
# arc_summary | grep -A 5 "ARC total accesses"
ARC total accesses (hits + misses):��
512.7M
�� Cache hit ratio:�� 99.8 %��
511.8M
�� Cache miss ratio:�� 0.2 %��
886.5k
�� Actual hit ratio (MFU + MRU hits):�� 99.3 %��
509.2M
�� Data demand efficiency:�� 99.5 %��
4.8M
�� Data prefetch efficiency:�� 19.2 %��
96.9k

In hindsight:

1.� I gathered file system statistics prior rebuilding the pool and
setting special_small_blocks=16K, but it now appears I could have
used a
larger value.

2.� If I get worried about HDD's failing, I can add disks to the pool
as
spares and/or add disks to the data mirrors.� The latter should
increase
read performance even more.

3.� My ~10 year old HDD's can already saturate Gigabit with
sequential
I/O.� RAID 10 with SSD acceleration is even more overkill.� I want 10
GbE.

David

--- PyGate Linux v1.5.14
* Origin: Dragon's Lair, PyGate NNTP<>Fido Gate (3:633/10)

From David Christensen@3:633/10 to All on Saturday, May 23, 2026 02:00:01

On 5/22/26 13:22, nwe wrote:

On 5/22/26 3:02 PM, Charles Curley wrote:

journalctl | grep -i RAM

Sure enough, that gets me a boatload of RAM error reports on my server.
On my desktop without ECC it does not. I think no noise = good, however,
I have rasdaemon installed on the server, I think it may take a
combination of that + ECC to make the RAM errors log. It's been a while since I set this up. I think I had to change a setting in the Dell bios
to prevent its log from eating the error instead of handing it to the os.

I was simply reading sudo dmesg.

If I'm correct, memtest86 is nearly useless on ECC RAM.

MemTest86 v11.7 Free Edition claims to support ECC:

https://www.memtest86.com/compare.html

AFAICT memtest86+ does not support ECC. Some people suggest disabling
ECC in BIOS/UEFI Setup and then testing with memtest86+.

David

--- PyGate Linux v1.5.14
* Origin: Dragon's Lair, PyGate NNTP<>Fido Gate (3:633/10)

From nwe@3:633/10 to All on Saturday, May 23, 2026 02:20:01

On 5/22/26 6:25 PM, David Christensen wrote:

I am still trying to understand the smartctl(8) "SMART Attributes Data Structure".� The RAW_VALUE seems to be a binary bit field (?)

same here

some makes/models of hardware seem to produce a greater quantity of comprehensible smart data

I run like
# smartctl -x /dev/sdf
returns additional data.

Twelve disks gives you many choices for how to layout the pool and
trade-off redundancy vs. capacity vs. performance.� Is the data
balanced across disks?� Does the machine have enough memory?� Is the
ARC working well?

It has 256GB RAM.

My desktop pc currently has only 1Gb networking ever since I replaced my
fiber optic card with a gpu in the lone PCIe slot. During the time I had
10g networking direct from server to workstation, I recall easily
saturating the network. Amazing speed, but I needed the gpu more. 10g networking is faster than the cheap SSDs in most of my PCs.

--- PyGate Linux v1.5.14
* Origin: Dragon's Lair, PyGate NNTP<>Fido Gate (3:633/10)

From David Christensen@3:633/10 to All on Saturday, May 23, 2026 07:50:01

On 5/22/26 17:16, nwe wrote:

On 5/22/26 6:25 PM, David Christensen wrote:

I am still trying to understand the smartctl(8) "SMART Attributes Data
Structure".� The RAW_VALUE seems to be a binary bit field (?)

same here

some makes/models of hardware seem to produce a greater quantity of comprehensible smart data

I run like
# smartctl -x /dev/sdf
returns additional data.

Twelve disks gives you many choices for how to layout the pool and
trade-off redundancy vs. capacity vs. performance.� Is the data
balanced across disks?� Does the machine have enough memory?� Is the
ARC working well?

It has 256GB RAM.

My desktop pc currently has only 1Gb networking ever since I replaced my fiber optic card with a gpu in the lone PCIe slot. During the time I had
10g networking direct from server to workstation, I recall easily
saturating the network. Amazing speed, but I needed the gpu more. 10g networking is faster than the cheap SSDs in most of my PCs.

12 HDD raidz3 already exceeds 10 Gbps for sequential I/O.

256 GB of memory could be enough to fit your entire workload within the
ARC, so random I/O may have also saturated 10 Gbps. If your processor
memory bus has enough channels, your NIC has enough PCIe lanes, and your workload does not require synchronous writes (or you tune ZFS to fake
it), a suitable workstation or backup server could saturate 25, 50, 100
Gbps single/ dual/ quad Ethernet. SFPx switches are expensive, so I
have considered dual SFPx cards in my workstation, primary server, and
backup server; connected in a ring.

If you have a USB 3.x A or C port, various manufacturers make 2.5, 5,
and 10 GbE (copper RJ-45) Ethernet adapters. If you have a Thunderbolt
3, 4, or 5 port, a few make 10 and 25 Gbps SFPx fiber single and dual
Ethernet adapters. Be sure to verify Debian and Linux driver support
with the manufacturer before purchasing:

https://www.amazon.com/s?k=thunderbolt+sfp+ethernet

https://www.sonnettech.com/product/twin25g/overview.html

I have been using Intel SSD 520 Series 2.5" SATA III drives for many
years. They are an enterprise grade product that were put in various desktops, laptops, and netbooks ~14 years ago. Resellers harvest and
resell them on eBay. The 60 GB model works well for Linux and FreeBSD
system drives. The 180 GB model has the best performance
specifications, and works well for Windows system drives and ZFS
accelerators (at my 6 TB scale). Prices for used smaller drives can
still be reasonable in spite of the AI bubble.

All that said, when your workload goes outside the ARC or the HDD
caches, HDD latency will become the bottleneck. SATA/SAS 6 or 12 Gbps
SSD accelerators chosen to match the workload could help. NVMe PCIe
would be even better. Optane would be best:

https://goughlui.com/2024/07/28/tech-flashback-intel-optane-3d-xpoint-memory/

How do you back up a 36 TB pool?

David

--- PyGate Linux v1.5.14
* Origin: Dragon's Lair, PyGate NNTP<>Fido Gate (3:633/10)

From nwe@3:633/10 to All on Saturday, May 23, 2026 17:00:01

On 5/23/26 12:44 AM, David Christensen wrote:

If you have a USB 3.x A or C port, various manufacturers make 2.5, 5,
and 10 GbE (copper RJ-45) Ethernet adapters.� If you have a
Thunderbolt 3, 4, or 5 port, a few make 10 and 25 Gbps SFPx fiber
single and dual Ethernet adapters.� Be sure to verify Debian and Linux driver support with the manufacturer before purchasing:

https://www.amazon.com/s?k=thunderbolt+sfp+ethernet

https://www.sonnettech.com/product/twin25g/overview.html

I've looked at those, been thinking of trying it some time. The only
time I really wish for faster than 1g networking straight to my
workstation is when I'm cloning a complete hard drive to a backup image
file in the pool.

I've been buying cheap 10G dual SFP PCIe cards on ebay. I configure the
two ports as a bridge then just run fiber from one machine to the next.
The server is in the middle of the string, so if I shut that down, my
whole network is as good as down. Nearly all my network services depend
on the server anyway.

The only glitch I've run into so far is I've got to match the correct
optics to the cards, speaking of vendor-lock-in. Intel cards want intel optics. Most other brands seem to accept most other brands optics. No experience yet with Cisco brand. Dell can come as intel or other.

Cheap managed network switches from aliexpress two 10g SFP+ ports plus
four 1g RJ45, I guess I got what I paid for. They seem to mostly work, I
had a few random failures along the way. I've read scare stories about
these potentially dialing home etc. I have not confirmed such. I suspect
these can't as I have them configured on an isolated vlan.

One wonderful network switch deal I find common on ebay: OS6450-P48 it
is cheap, 48 poe 1gig ports plus two 10g SFP ports that don't seem real
picky what brand optics I use. It is a bit technical to configure.

I have been using Intel SSD 520 Series 2.5" SATA III drives for many
years.

Same here, just discovered them cheap on ebay a few years ago. Some show
up with SMART reports indicating wear that would trigger me replacing a consumer grade ssd. I've been using those nearly daily over 2 years with
no failures yet.

How do you back up a 36 TB pool?

(blush)

I know my current backup scheme could someday bite:

1. I have a full rsync backup on a spare R710 server in another
building. That server probably hasn't been powered up in 18 months. All
I need to do is go plug it into an outlet then spend a couple minutes at
a ssh session from the comfort of my office chair.

2. The most critical files get regularly backed up (manually) to a
remote storage vps.
I compress the files into an encrypted archive then upload using scp.

3. Some random memory sticks and hard drives...

So I've got some coverage in case of disaster but it could use improvement.
I know automatic backups could be nice but I just don't trust such.

I also have a clone of the server's boot drive.

nwe

--- PyGate Linux v1.5.14
* Origin: Dragon's Lair, PyGate NNTP<>Fido Gate (3:633/10)

From David Christensen@3:633/10 to All on Sunday, May 24, 2026 04:30:01

On 5/23/26 07:52, nwe wrote:

On 5/23/26 12:44 AM, David Christensen wrote:

If you have a USB 3.x A or C port, various manufacturers make 2.5, 5,
and 10 GbE (copper RJ-45) Ethernet adapters.� If you have a
Thunderbolt 3, 4, or 5 port, a few make 10 and 25 Gbps SFPx fiber
single and dual Ethernet adapters.

I've looked at those, been thinking of trying it some time. The only
time I really wish for faster than 1g networking straight to my
workstation is when I'm cloning a complete hard drive to a backup image
file in the pool.

+1 for images, especially when they are encrypted and uncompressible. I
image my system drives monthly and keep the newest three. The Linux and FreeBSD system images are deliberately small enough to fit onto "16 GB" devices (SD card, USB, SSD, or HDD). This is plenty for the servers and maintenance/rescue live drives, but I sometimes want more for the
workstation. The worst are Windows machines with one drive and
everything on it.

Another use-case for enterprise-speed Ethernet is ZFS replication.

The only glitch I've run into so far is I've got to match the correct
optics to the cards, speaking of vendor-lock-in. Intel cards want intel optics. Most other brands seem to accept most other brands optics. No experience yet with Cisco brand. Dell can come as intel or other.

Thank you for the warning.

Cheap managed network switches from aliexpress two 10g SFP+ ports plus
four 1g RJ45, I guess I got what I paid for. They seem to mostly work, I
had a few random failures along the way. I've read scare stories about
these potentially dialing home etc. I have not confirmed such. I suspect these can't as I have them configured on an isolated vlan.

One wonderful network switch deal I find common on ebay: OS6450-P48 it
is cheap, 48 poe 1gig ports plus two 10g SFP ports that don't seem real picky what brand optics I use. It is a bit technical to configure.

I switched to Ubiquitti Networks Unifi products several years ago when I
got tired of logging in to multiple different web control panels and
trying to keep all the network settings in sync. "One web control panel
to rule them all" is a killer feature that is worth paying for. So, I
look at their 2.5/5/10 Gbps switches periodically; but have not made a purchase (yet).

I have been using Intel SSD 520 Series 2.5" SATA III drives for many
years.

Same here, just discovered them cheap on ebay a few years ago. Some show
up with SMART reports indicating wear that would trigger me replacing a consumer grade ssd. I've been using those nearly daily over 2 years with
no failures yet.

I bought my first Intel SSD 520 Series 60 GB 2.5" SATA III at Best Buy
on Black Friday the year they came out. I now have eight 60 GB drives
plus eight 180 GB drives. Many were bought used. I agree that some of
their SMART statistics can look scary, but the lowest
Media_Wearout_Indicator VALUE for any of them is 95. Using 180 GB
drives as a ZFS special vdev with special_small_blocks and 24x7 in a
lightly used SOHO server is slowly eating those drives.

How do you back up a 36 TB pool?

(blush)

I know my current backup scheme could someday bite:

1. I have a full rsync backup on a spare R710 server in another
building. That server probably hasn't been powered up in 18 months. All
I need to do is go plug it into an outlet then spend a couple minutes at
a ssh session from the comfort of my office chair.

2. The most critical files get regularly backed up (manually) to a
remote storage vps.
I compress the files into an encrypted archive then upload using scp.

3. Some random memory sticks and hard drives...

So I've got some coverage in case of disaster but it could use improvement.
I know automatic backups could be nice but I just don't trust such.

I also have a clone of the server's boot drive.

nwe

I came to the conclusion that I needed a matching server for backups.
So, I built one and replicate weekly. It sounds like you could dust off
the R710. An enterprise-grade Ethernet connection between the primary
and backup servers would encourage regular backups.

I also replicate to a near-site external disk weekly, and rotate that
with an off-site disk bi-monthly. Seagate says they make a 36 TB HDD,
but I cannot find any for sale or in stock on the WWW. An external RAID enclosure could reach that capacity with smaller drives, but external
drive enclosures have failed me too many times over the years.

David

--- PyGate Linux v1.5.14
* Origin: Dragon's Lair, PyGate NNTP<>Fido Gate (3:633/10)

From Anssi Saari@3:633/10 to Unknown on Monday, May 25, 2026 11:50:02

nwe <nwe@gitcoding.net> writes:

If I'm correct, memtest86 is nearly useless on ECC RAM.

There's a commercial variant that has an option to enable error
injection mode but on the two computers I've tried it, the manufacturer
has conveniently disabled that.

The only remaining way for consumers to see ECC in action is apparently
to undervolt the RAM to force errors. Assuming such a setting is
available. And seems like lots of work for minimal gain.

--- PyGate Linux v1.5.15
* Origin: Dragon's Lair, PyGate NNTP<>Fido Gate (3:633/10)

From Andy Smith@3:633/10 to Unknown on Monday, May 25, 2026 13:20:01

Hi,

On Fri, May 22, 2026 at 04:58:49PM -0700, David Christensen wrote:

On 5/22/26 13:22, nwe wrote:

On 5/22/26 3:02 PM, Charles Curley wrote:
If I'm correct, memtest86 is nearly useless on ECC RAM.

MemTest86 v11.7 Free Edition claims to support ECC:

https://www.memtest86.com/compare.html

AFAICT memtest86+ does not support ECC. Some people suggest disabling ECC
in BIOS/UEFI Setup and then testing with memtest86+.

The only machines I have ECC RAM in also have Machine Check Exception
messages go to a log in the firmware. I have experienced running
memtest86 for a few successful complete passes and then finding messages
in the firmware log about things that were corrected, which enabled me
to locate the bad stick.

Also, most of the time I've had RAM fail it's done so in a way that ECC
can't fix, because ECC can only correct a single bit flip.

So, I have continued ti find memtest86 useful although it would be a bit
less so if I had no way to see the MCE log.

Thanks,
Andy

--
https://bitfolk.com/ -- No-nonsense VPS hosting

--- PyGate Linux v1.5.15
* Origin: Dragon's Lair, PyGate NNTP<>Fido Gate (3:633/10)

From David Christensen@3:633/10 to All on Monday, May 25, 2026 23:20:01

On 5/25/26 04:16, Andy Smith wrote:

Hi,

Hello. :-)

The only machines I have ECC RAM in also have Machine Check Exception messages go to a log in the firmware. I have experienced running
memtest86 for a few successful complete passes and then finding messages
in the firmware log about things that were corrected, which enabled me
to locate the bad stick.

Also, most of the time I've had RAM fail it's done so in a way that ECC
can't fix, because ECC can only correct a single bit flip.

So, I have continued ti find memtest86 useful although it would be a bit
less so if I had no way to see the MCE log.

It sounds like I need to learn about Debian's "collectd-core" module and mcelog(8) (?).

David

--- PyGate Linux v1.5.15
* Origin: Dragon's Lair, PyGate NNTP<>Fido Gate (3:633/10)

From Charles Curley@3:633/10 to All on Monday, June 01, 2026 23:00:02

On Fri, 22 May 2026 10:05:56 -0600
Andrew Latham <lathama@gmail.com> wrote:

I had an issue some months back. It turned out to be a bad RAM stick
in my NAS. The issues would not show up on a restart but after some
usage it would hit the RAM errors and :(

This may not be your issue, but I remember how annoying it was to
figure out.

Thanks. I tried testing for this. I recently had one of two RAM sticks
go bad. I bought replacements in April and installed them. To test for
one of the new sticks being bad, I put the one good stick in instead of
the two new ones. The problem shows up. So I'm not sure whether I have
bad RAM or not.

--
Does anybody read signatures any more?

https://charlescurley.com
https://charlescurley.com/blog/

--- PyGate Linux v1.5.15
* Origin: Dragon's Lair, PyGate NNTP<>Fido Gate (3:633/10)

From Andrew Latham@3:633/10 to All on Monday, June 01, 2026 23:10:01

Charles

There are many things that might be wrong, just reading your OP I
thought about how it matched up with my bad ram situation.

Beyond chasing hardware, is there any situation where the checksums or
tar are being run against files that are being written to? Should not
hurt to ask

Also, are there any time stamps you can match up to system logs and or dmes
g?

On Mon, Jun 1, 2026 at 2:57?PM Charles Curley
<charlescurley@charlescurley.com> wrote:

On Fri, 22 May 2026 10:05:56 -0600
Andrew Latham <lathama@gmail.com> wrote:

I had an issue some months back. It turned out to be a bad RAM stick
in my NAS. The issues would not show up on a restart but after some
usage it would hit the RAM errors and :(

This may not be your issue, but I remember how annoying it was to
figure out.

Thanks. I tried testing for this. I recently had one of two RAM sticks
go bad. I bought replacements in April and installed them. To test for
one of the new sticks being bad, I put the one good stick in instead of
the two new ones. The problem shows up. So I'm not sure whether I have
bad RAM or not.

--
Does anybody read signatures any more?

https://charlescurley.com
https://charlescurley.com/blog/

--
- Andrew "lathama" Latham -

--- PyGate Linux v1.5.15
* Origin: Dragon's Lair, PyGate NNTP<>Fido Gate (3:633/10)

From Charles Curley@3:633/10 to All on Monday, June 01, 2026 23:20:01

On Fri, 22 May 2026 09:53:17 -0600
Charles Curley <charlescurley@charlescurley.com> wrote:

To be thorough, I have run extended SMART tests on the hard drives,
kicked mdadm into testing the RAID array, and fscked the LVM
partitions on the RAID array. Only fsck turned up issues, and that
has not stopped.

Some additional testing.

Suspecting a bad hard drive, I ran more extended tests on all four
members of the RAID array. One showed problems:

"Error 1 [0] occurred at disk power-on lifetime: 6777 hours (282 days + 9 hours)",
" When the command that caused the error occurred, the device was active or idle.",
"",
" After command completion occurred, registers were:",
" ER -- ST COUNT LBA_48 LH LM LL DV DC",
" -- -- -- == -- == == == -- -- -- -- --",
" 40 -- 51 00 01 00 00 00 00 00 00 40 00 Error: UNC 1 sectors at LBA = 0x00000000 = 0",
"",
" Commands leading to the command that caused the error were:",
" CR FEATR COUNT LBA_48 LH LM LL DV DC Powered_Up_Time Command/Feature_Name",
" -- == -- == -- == == == -- -- -- -- -- --------------- --------------------",
" 25 00 00 00 01 00 00 00 00 00 00 40 00 00:08:36.585 READ DMA EXT",
" ec 00 00 00 00 00 00 00 00 00 00 00 00 00:08:31.545 IDENTIFY DEVICE",
" b0 00 da 00 00 00 00 00 c2 4f 00 00 00 00:08:31.542 SMART RETURN STATUS",
" b0 00 d2 00 f1 00 00 00 c2 4f 00 00 00 00:08:31.541 SMART ENABLE/DISABLE ATTRIBUTE AUTOSAVE",
" ec 00 00 00 00 00 00 00 00 00 00 00 00 00:08:31.541 IDENTIFY DEVICE",
"",
"SMART Extended Self-test Log Version: 1 (1 sectors)",
"Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error",
"# 1 Extended offline Completed without error 00% 6756 -",
"# 2 Extended offline Completed without error 00% 6573 -",
"# 3 Extended offline Completed without error 00% 102 -",
"# 4 Short offline Completed without error 00% 96 -",
"",

So I did the obvious: I failed and remove the drive from the array. The
problem still showed up, but not as many fails in the same data set.

I have since added the drive back to the array, and am testing the
array now.

mdadm --monitor --test --oneshot /dev/md0

I begin to wonder if I have a bad motherboard.

--
Does anybody read signatures any more?

https://charlescurley.com
https://charlescurley.com/blog/

--- PyGate Linux v1.5.15
* Origin: Dragon's Lair, PyGate NNTP<>Fido Gate (3:633/10)

From Andrew Latham@3:633/10 to All on Monday, June 01, 2026 23:30:01

I hate to suggest other tangents but re-seat all the connectors and
maybe a power supply test. A brown-out of power would cause issues
like this.

On Mon, Jun 1, 2026 at 3:15?PM Charles Curley
<charlescurley@charlescurley.com> wrote:

On Fri, 22 May 2026 09:53:17 -0600
Charles Curley <charlescurley@charlescurley.com> wrote:

To be thorough, I have run extended SMART tests on the hard drives,
kicked mdadm into testing the RAID array, and fscked the LVM
partitions on the RAID array. Only fsck turned up issues, and that
has not stopped.

Some additional testing.

Suspecting a bad hard drive, I ran more extended tests on all four
members of the RAID array. One showed problems:

"Error 1 [0] occurred at disk power-on lifetime: 6777 hours (282 da

ys + 9 hours)",

" When the command that caused the error occurred, the device was

active or idle.",

"",
" After command completion occurred, registers were:",
" ER -- ST COUNT LBA_48 LH LM LL DV DC",
" -- -- -- == -- == == == -- -- -- -- --",
" 40 -- 51 00 01 00 00 00 00 00 00 40 00 Error: UNC 1 sectors at

LBA = 0x00000000 = 0",

"",
" Commands leading to the command that caused the error were:",
" CR FEATR COUNT LBA_48 LH LM LL DV DC Powered_Up_Time Command

/Feature_Name",

" -- == -- == -- == == == -- -- -- -- -- ----

----------- --------------------",

" 25 00 00 00 01 00 00 00 00 00 00 40 00 00:08:36.585 READ DM

A EXT",

" ec 00 00 00 00 00 00 00 00 00 00 00 00 00:08:31.545 IDENTIF

Y DEVICE",

" b0 00 da 00 00 00 00 00 c2 4f 00 00 00 00:08:31.542 SMART R

ETURN STATUS",

" b0 00 d2 00 f1 00 00 00 c2 4f 00 00 00 00:08:31.541 SMART E

NABLE/DISABLE ATTRIBUTE AUTOSAVE",

" ec 00 00 00 00 00 00 00 00 00 00 00 00 00:08:31.541 IDENTIF

Y DEVICE",

"",
"SMART Extended Self-test Log Version: 1 (1 sectors)",
"Num Test_Description Status Remaining LifeTi

me(hours) LBA_of_first_error",

"# 1 Extended offline Completed without error 00% 67

56 -",

"# 2 Extended offline Completed without error 00% 65

73 -",

"# 3 Extended offline Completed without error 00% 1

02 -",

"# 4 Short offline Completed without error 00%

96 -",

"",

So I did the obvious: I failed and remove the drive from the array. The problem still showed up, but not as many fails in the same data set.

I have since added the drive back to the array, and am testing the
array now.

mdadm --monitor --test --oneshot /dev/md0

I begin to wonder if I have a bad motherboard.

--
Does anybody read signatures any more?

https://charlescurley.com
https://charlescurley.com/blog/

--
- Andrew "lathama" Latham -

--- PyGate Linux v1.5.15
* Origin: Dragon's Lair, PyGate NNTP<>Fido Gate (3:633/10)

From Charles Curley@3:633/10 to All on Tuesday, June 02, 2026 00:00:01

On Mon, 1 Jun 2026 15:23:02 -0600
Andrew Latham <lathama@gmail.com> wrote:

I hate to suggest other tangents but re-seat all the connectors and
maybe a power supply test. A brown-out of power would cause issues
like this.

I did reseat all the data and power cables for the hard drives, the SSD
system drive and the CD, etc. drive. I doubt I've had a brown-out, as
the computer in question is on a UPS which has plenty of spare battery.

As for the power supply test, not having a power supply tester, I have
not done that.

--
Does anybody read signatures any more?

https://charlescurley.com
https://charlescurley.com/blog/

--- PyGate Linux v1.5.15
* Origin: Dragon's Lair, PyGate NNTP<>Fido Gate (3:633/10)

From David Christensen@3:633/10 to All on Tuesday, June 02, 2026 00:50:01

On 6/1/26 14:15, Charles Curley wrote:

Some additional testing.

Suspecting a bad hard drive, I ran more extended tests on all four
members of the RAID array. One showed problems:

"Error 1 [0] occurred at disk power-on lifetime: 6777 hours (282 days + 9 hours)",
" When the command that caused the error occurred, the device was active or idle.",
"",
" After command completion occurred, registers were:",
" ER -- ST COUNT LBA_48 LH LM LL DV DC",
" -- -- -- == -- == == == -- -- -- -- --",
" 40 -- 51 00 01 00 00 00 00 00 00 40 00 Error: UNC 1 sectors at LBA = 0x00000000 = 0",
"",
" Commands leading to the command that caused the error were:",
" CR FEATR COUNT LBA_48 LH LM LL DV DC Powered_Up_Time Command/Feature_Name",
" -- == -- == -- == == == -- -- -- -- -- --------------- --------------------",
" 25 00 00 00 01 00 00 00 00 00 00 40 00 00:08:36.585 READ DMA EXT",
" ec 00 00 00 00 00 00 00 00 00 00 00 00 00:08:31.545 IDENTIFY DEVICE",
" b0 00 da 00 00 00 00 00 c2 4f 00 00 00 00:08:31.542 SMART RETURN STATUS",
" b0 00 d2 00 f1 00 00 00 c2 4f 00 00 00 00:08:31.541 SMART ENABLE/DISABLE ATTRIBUTE AUTOSAVE",
" ec 00 00 00 00 00 00 00 00 00 00 00 00 00:08:31.541 IDENTIFY DEVICE",
"",
"SMART Extended Self-test Log Version: 1 (1 sectors)",
"Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error",
"# 1 Extended offline Completed without error 00% 6756 -",
"# 2 Extended offline Completed without error 00% 6573 -",
"# 3 Extended offline Completed without error 00% 102 -",
"# 4 Short offline Completed without error 00% 96 -",
"",

So I did the obvious: I failed and remove the drive from the array. The problem still showed up, but not as many fails in the same data set.

I have since added the drive back to the array, and am testing the
array now.

mdadm --monitor --test --oneshot /dev/md0

I begin to wonder if I have a bad motherboard.

Up until 2019, I was using Debian GNU/Linux on desktop hardware as a
file server. When I upgraded to a server motherboard and ECC memory, I started seeing DMA errors. During trouble-shooting, I realized that I
had been collecting SATA parts since the days of SATA I 150 Gbps --
HBA's, cables, racks, and drawers. My file server had a mix of various
known and unknown parts, including red SATA cables (red dye can cause
copper conductors to oxidize into dust). So, I replaced all of the
unknown and obsolete parts with new parts clearly rated and marked for
SATA III 6 Gbps. The disk problems went away. When I wanted more
HDD's, I bought SAS 6 Gbps HBA's, cables, and HDD's.

Similarly, most of the memory problems I encountered were caused by incompatibility between the motherboard and the memory module(s). I
suggest documenting your motherboard, documenting your memory modules,
and doing the homework. Memory manufacturers typically have a search
feature on their web site that will produce a list of compatible memory modules given a computer or motherboard make and model. eBay sellers
often include the computer/motherboard make/model for pulled memory
modules. And, you can always STFW.

For a server, I prefer and recommended workstation/server motherboards,
ECC memory, ext4/UFS for the system disk, and ZFS RAID10 for data.

David

--- PyGate Linux v1.5.15
* Origin: Dragon's Lair, PyGate NNTP<>Fido Gate (3:633/10)

From Andrew Latham@3:633/10 to All on Tuesday, June 02, 2026 01:20:01

I was less than clear, I meant a brown-out on the DC rail. Checksums
do cause a power spike.

On Mon, Jun 1, 2026 at 3:56?PM Charles Curley
<charlescurley@charlescurley.com> wrote:

On Mon, 1 Jun 2026 15:23:02 -0600
Andrew Latham <lathama@gmail.com> wrote:

I hate to suggest other tangents but re-seat all the connectors and
maybe a power supply test. A brown-out of power would cause issues
like this.

I did reseat all the data and power cables for the hard drives, the SSD system drive and the CD, etc. drive. I doubt I've had a brown-out, as
the computer in question is on a UPS which has plenty of spare battery.

As for the power supply test, not having a power supply tester, I have
not done that.

--
Does anybody read signatures any more?

https://charlescurley.com
https://charlescurley.com/blog/

--
- Andrew "lathama" Latham -

--- PyGate Linux v1.5.15
* Origin: Dragon's Lair, PyGate NNTP<>Fido Gate (3:633/10)

From David Christensen@3:633/10 to All on Tuesday, June 02, 2026 01:20:01

On 6/1/26 14:55, Charles Curley wrote:

I did reseat all the data and power cables for the hard drives, the SSD system drive and the CD, etc. drive.

Good.

I doubt I've had a brown-out, as
the computer in question is on a UPS which has plenty of spare battery.

Check the voltage at the UPS input with a good voltmeter. In the USA,
voltage should be 120 VAC. National Electrical Code tolerance is +/- 5%.

A good UPS should be able to accept dirty out-of-spec input and produce
clean, tight output.

Do not plug laser printers, photo copiers, or any other heavy loads into
the computer UPS. Those machines draw significant current in surges and
can clip the output of UPS's, causing the voltage to sag, lights to
flicker, and electronics to malfunction.

As for the power supply test, not having a power supply tester, I have
not done that.

Buy one. It will save your sanity.

Related -- I bought good cases, power supplies, and fans when I built my
most recent servers:

https://www.fractal-design.com/products/cases/define/define-r5/black/

https://www.fractal-design.com/products/power-supplies/ion/ion-2-platinum-660w/black/

https://www.fractal-design.com/products/fans/dynamic/dynamic-gp-14/white/

David

--- PyGate Linux v1.5.15
* Origin: Dragon's Lair, PyGate NNTP<>Fido Gate (3:633/10)

From nwe@3:633/10 to All on Tuesday, June 02, 2026 06:30:01

On 6/1/26 4:23 PM, Andrew Latham wrote:

I hate to suggest other tangents but re-seat all the connectors and
maybe a power supply test. A brown-out of power would cause issues
like this.

+1

I've seen too many hard-to-diagnose problems that disappeared after
replacing a sketchy PSU. Random hangs and crashes, hard drive errors,
video noise, hang on reboot, etc. I no longer try to go cheap when
choosing a PSU for builds.

For example, I currently recommend Corsair 750 Watt as a minimum,
however, there are also other good quality brands. I concede there are
lots of PCs out there running PSUs lesser than Corsair 750 Watt just
fine. I do notice the PC on which I'm typing this is still running a
cheap knock-off 350 Watt PSU that has not yet caused me any problems.

--- PyGate Linux v1.5.15
* Origin: Dragon's Lair, PyGate NNTP<>Fido Gate (3:633/10)

From Stefan Monnier@3:633/10 to All on Tuesday, June 02, 2026 16:10:01

I've seen too many hard-to-diagnose problems that disappeared after
replacing a sketchy PSU. Random hangs and crashes, hard drive errors, video noise, hang on reboot, etc. I no longer try to go cheap when choosing a PSU for builds.

Yeah, I'm actually surprised the hardware hasn't caught on accordingly:
while it's now standard for chips to monitor their temperature (and
adjust their power consumption if it gets too high), I still haven't
seen anything comparable that would detect and report when the input
voltage goes out-of-range (and maybe also take steps to reduce the instantaneous power consumption?).

Instead, we're in the dark, forced to try to avoid the problem
by over-provisioning the power supply and pray that it was enough.

=== Stefan

--- PyGate Linux v1.5.15
* Origin: Dragon's Lair, PyGate NNTP<>Fido Gate (3:633/10)

From David Christensen@3:633/10 to All on Tuesday, June 02, 2026 21:10:01

On 6/2/26 07:02, Stefan Monnier wrote:

I've seen too many hard-to-diagnose problems that disappeared after
replacing a sketchy PSU. Random hangs and crashes, hard drive errors, video >> noise, hang on reboot, etc. I no longer try to go cheap when choosing a PSU >> for builds.

Yeah, I'm actually surprised the hardware hasn't caught on accordingly:
while it's now standard for chips to monitor their temperature (and
adjust their power consumption if it gets too high), I still haven't
seen anything comparable that would detect and report when the input
voltage goes out-of-range (and maybe also take steps to reduce the instantaneous power consumption?).

Instead, we're in the dark, forced to try to avoid the problem
by over-provisioning the power supply and pray that it was enough.

=== Stefan

https://en.wikipedia.org/wiki/Lm_sensors

lm_sensors (Linux-monitoring sensors) is a free open-source
software-tool for Linux that provides tools and drivers for monitoring temperatures, voltage, humidity, and fans. It can also detect chassis intrusions.

https://packages.debian.org/search?searchon=names&keywords=lm-sensors

David

--- PyGate Linux v1.5.15
* Origin: Dragon's Lair, PyGate NNTP<>Fido Gate (3:633/10)

From Dan Ritter@3:633/10 to All on Tuesday, June 02, 2026 21:30:02

David Christensen wrote:

On 6/2/26 07:02, Stefan Monnier wrote:

I've seen too many hard-to-diagnose problems that disappeared after replacing a sketchy PSU. Random hangs and crashes, hard drive errors, video
noise, hang on reboot, etc. I no longer try to go cheap when choosing a PSU
for builds.

Yeah, I'm actually surprised the hardware hasn't caught on accordingly: while it's now standard for chips to monitor their temperature (and
adjust their power consumption if it gets too high), I still haven't
seen anything comparable that would detect and report when the input voltage goes out-of-range (and maybe also take steps to reduce the instantaneous power consumption?).

https://en.wikipedia.org/wiki/Lm_sensors

lm_sensors (Linux-monitoring sensors) is a free open-source software-tool
for Linux that provides tools and drivers for monitoring temperatures, voltage, humidity, and fans. It can also detect chassis intrusions.

Almost all of those sensors are on the motherboard or on
attached cards; you'll frequently get to see voltages from the
CPU, but hardly ever does a power supply tell you about its
load.

If you plug in a UPS, most (but not all) can give you
information about the wall outlet power and how much is
currently being drawn by the machines attached to the UPS.

-dsr-

--- PyGate Linux v1.5.15
* Origin: Dragon's Lair, PyGate NNTP<>Fido Gate (3:633/10)

From Joe@3:633/10 to All on Tuesday, June 02, 2026 21:50:01

On Tue, 2 Jun 2026 15:06:20 -0400
Dan Ritter <dsr@randomstring.org> wrote:

David Christensen wrote:

On 6/2/26 07:02, Stefan Monnier wrote:

I've seen too many hard-to-diagnose problems that disappeared
after replacing a sketchy PSU. Random hangs and crashes, hard
drive errors, video noise, hang on reboot, etc. I no longer
try to go cheap when choosing a PSU for builds.

Yeah, I'm actually surprised the hardware hasn't caught on
accordingly: while it's now standard for chips to monitor their temperature (and adjust their power consumption if it gets too
high), I still haven't seen anything comparable that would detect
and report when the input voltage goes out-of-range (and maybe
also take steps to reduce the instantaneous power consumption?).

https://en.wikipedia.org/wiki/Lm_sensors

lm_sensors (Linux-monitoring sensors) is a free open-source
software-tool for Linux that provides tools and drivers for
monitoring temperatures, voltage, humidity, and fans. It can also
detect chassis intrusions.

Almost all of those sensors are on the motherboard or on
attached cards; you'll frequently get to see voltages from the
CPU, but hardly ever does a power supply tell you about its
load.

They are also going to be polled, and will return the voltage they find
at polling time. What they won't tell you is what various ripple
voltages are, both the initial rectified mains at 100/120Hz and the
residual switching frequencies of the various step-down regulators.

Under some combination of conditions, including temperature, the
combinations of ripple may allow a regulator output to drop out of spec
for a microsecond, more than enough to corrupt a signal on e.g. a SATA
line. Further, this may only happen once a week or so.

Increasing ripple voltage is what happens when smoothing capacitor
electrolytes dry out, which they will eventually do with age.

Einstein allegedly said:

"Insanity is doing the same thing over and over again and expecting
different results"

He had obviously never encountered the Intermittent Fault.

--
Joe

--- PyGate Linux v1.5.15
* Origin: Dragon's Lair, PyGate NNTP<>Fido Gate (3:633/10)

From Stefan Monnier@3:633/10 to All on Wednesday, June 03, 2026 16:40:01

They are also going to be polled, and will return the voltage they find
at polling time. What they won't tell you is what various ripple
voltages are, both the initial rectified mains at 100/120Hz and the
residual switching frequencies of the various step-down regulators.

Yeah, I think we'd need a kind of sensor that doesn't give just the
current voltage but gives a bracket of the lowest & highest voltage seen
since the last measurement, or one that can trigger an interrupt if the
voltage ever goes outside of a given range.

=== Stefan

--- PyGate Linux v1.5.15
* Origin: Dragon's Lair, PyGate NNTP<>Fido Gate (3:633/10)

From Stefan Monnier@3:633/10 to All on Wednesday, June 03, 2026 16:40:01

Yeah, I'm actually surprised the hardware hasn't caught on accordingly:
while it's now standard for chips to monitor their temperature (and
adjust their power consumption if it gets too high), I still haven't
seen anything comparable that would detect and report when the input
voltage goes out-of-range (and maybe also take steps to reduce the
instantaneous power consumption?).
Instead, we're in the dark, forced to try to avoid the problem
by over-provisioning the power supply and pray that it was enough.

lm_sensors (Linux-monitoring sensors) is a free open-source software-tool
for Linux that provides tools and drivers for monitoring temperatures, voltage, humidity, and fans. It can also detect chassis intrusions.

Two problems:

- AFAIK there is no standard system to actively use those voltage
sensors to detect potentially harmful situations and take measures
(not even logging weird voltage events seems to be standard).
- My understanding is that the problems related to power supplies which
we'd like to catch are related to *very* transient variations on input
voltage, and AFAIK the infrastructure around those sensors just isn't
equipped to detect such variations.

=== Stefan

--- PyGate Linux v1.5.15
* Origin: Dragon's Lair, PyGate NNTP<>Fido Gate (3:633/10)

From David Christensen@3:633/10 to All on Thursday, June 04, 2026 20:40:01

On 6/3/26 19:12, Eben King wrote:

On 6/3/26 10:36, Stefan Monnier wrote:

They are also going to be polled, and will return the voltage they find
at polling time. What they won't tell you is what various ripple
voltages are, both the initial rectified mains at 100/120Hz and the
residual switching frequencies of the various step-down regulators.

Yeah, I think we'd need a kind of sensor that doesn't give just the
current voltage but gives a bracket of the lowest & highest voltage seen
since the last measurement, or one that can trigger an interrupt if the
voltage ever goes outside of a given range.

I want something that does a Fourier transform of the voltage data so I
can see if there's ripple at any particular frequency. A live waterfall
plot would be wonderful.

https://www.tek.com/en/products/oscilloscopes/dpo70000sx

David

--- PyGate Linux v1.5.15
* Origin: Dragon's Lair, PyGate NNTP<>Fido Gate (3:633/10)

From Joe@3:633/10 to All on Friday, June 05, 2026 19:30:01

On Fri, 5 Jun 2026 11:54:00 -0400
duh <fill_in_the_blanks@email.com> wrote:

On 6/2/26 15:47, Joe wrote:

Einstein allegedly said:

"Insanity is doing the same thing over and over again and
expecting different results"

He had obviously never encountered the Intermittent Fault.

I may be al ittle slow upstairs, but how does this differ from

persistence??????? Just asking!�� :-)

He seems to be implying that the universe is completely deterministic,
whereas at the scale which humans can deal with, it isn't. He was
implying that persistence in trying to locate an intermittent fault was
akin to insanity.

Which was particularly silly because he was also alleged to have said
'God does not play dice', showing that he was aware of the existence of
a device invented for the specific purpose of obtaining different
results while doing the same (at a human level) thing.

--
Joe

--- PyGate Linux v1.5.15
* Origin: Dragon's Lair, PyGate NNTP<>Fido Gate (3:633/10)

From Charles Curley@3:633/10 to All on Sunday, June 21, 2026 23:10:01

On Fri, 22 May 2026 09:53:17 -0600
Charles Curley <charlescurley@charlescurley.com> wrote:

I have four four terabyte hard drives. Each has a partition on it. The
four partitions comprise a RAID 5 array using mdadm. On top of that,
LUKS encryption, then LVM with ext4 logical volumes.

I believe I have found a solution to this problem. I installed the
backports kernel. Since then I have run more than four hours solid of
tests and not found a single error.

I did replace one hard drive. While that resulted in a quieter office,
it did not solve the problem.

Checking voltages from the power supply and the wall with a digital
volt meter did not show any out of spec problems.

--
Does anybody read signatures any more?

https://charlescurley.com
https://charlescurley.com/blog/

--- PyGate Linux v1.5.17
* Origin: Dragon's Lair, PyGate NNTP<>Fido Gate (3:633/10)

From David Christensen@3:633/10 to All on Monday, June 22, 2026 09:40:01

On Fri, 22 May 2026 09:53:17 -0600 Charles Curley wrote:

I have four four terabyte hard drives. Each has a partition on it. The
four partitions comprise a RAID 5 array using mdadm. On top of that,
LUKS encryption, then LVM with ext4 logical volumes.

On one LVM partition I have a number of backup files, tarred,
bzipped, and sha256 and sha512 summed. I have a script which will find checksum files, and execute the appropriate program to test the
archives. It puts each program into the background, parallising any
number of checksum tests.

On 6/21/26 14:05, Charles Curley wrote:

I believe I have found a solution to this problem. I installed the
backports kernel. Since then I have run more than four hours solid of
tests and not found a single error.

I did replace one hard drive. While that resulted in a quieter office,
it did not solve the problem.

Checking voltages from the power supply and the wall with a digital
volt meter did not show any out of spec problems.

I am glad that your storage is working correctly now.

Please run and post the following commands with both the previous kernel
and the backports kernel:

$ cat /etc/debian_version

$ uname -a

I will assume your script spawns a separate, isolated process for each checksum file.

If you have ruled out the power supply, memory, and disks, another
possibility could be a race condition in the kernel and/or I/O stack
that is triggered when multiple processes access storage in parallel.

Are the checksum errors repeatable on another computer with a similar
storage architecture and the previous kernel? If so, do they disappear
with the backports kernel?

David

--- PyGate Linux v1.5.17
* Origin: Dragon's Lair, PyGate NNTP<>Fido Gate (3:633/10)

From Charles Curley@3:633/10 to All on Monday, June 22, 2026 19:00:01

On Mon, 22 Jun 2026 00:32:19 -0700
David Christensen <dpchrist@holgerdanske.com> wrote:

I am glad that your storage is working correctly now.

Thank you.

Please run and post the following commands with both the previous
kernel and the backports kernel:

$ cat /etc/debian_version

$ uname -a

Backport

root@hawk:~# cat /etc/debian_version
13.5
root@hawk:~# cat /etc/os-release
PRETTY_NAME="Debian GNU/Linux 13 (trixie)"
NAME="Debian GNU/Linux"
VERSION_ID="13"
VERSION="13 (trixie)"
VERSION_CODENAME=trixie
DEBIAN_VERSION_FULL=13.5
ID=debian
HOME_URL="https://www.debian.org/"
SUPPORT_URL="https://www.debian.org/support" BUG_REPORT_URL="https://bugs.debian.org/"
root@hawk:~# uname -s
Linux
root@hawk:~# uname -a
Linux hawk 7.0.10+deb13-amd64 #1 SMP PREEMPT_DYNAMIC Debian 7.0.10-1~bpo13+1 (2026-05-28) x86_64 GNU/Linux
root@hawk:~#

Prior kernel is the same except the kernel is

linux-image-6.12.90+deb13.1-amd64 6.12.90-2 amd64

I think that this problem first showed up late April or early May. If
that is correct, the following extracts from my apt logs might help.
These were copied and pasted unwrapped, so unless they were mangled in
transit, long lines should be preserved.

Start-Date: 2026-04-13 12:46:25
Commandline: apt install -t trixie-backports linux-image-amd64
Install: linux-image-6.19.10+deb13-amd64:amd64 (6.19.10-1~bpo13+1, automatic), linux-modules-6.19.10+deb13-amd64:amd64 (6.19.10-1~bpo13+1, automatic), linux-base-amd64:amd64 (6.19.10-1~bpo13+1, automatic), linux-binary-6.19.10+deb13-amd64:amd64 (6.19.10-1~bpo13+1, automatic), linux-base-6.19.10+deb13-amd64:amd64 (6.19.10-1~bpo13+1, automatic)
Upgrade: linux-image-amd64:amd64 (6.12.74-2, 6.19.10-1~bpo13+1), linux-libc-dev:amd64 (6.12.74-2, 6.19.10-1~bpo13+1)
End-Date: 2026-04-13 12:46:52

Start-Date: 2026-04-16 08:45:37
Commandline: apt remove linux-image-amd64
Remove: linux-image-amd64:amd64 (6.19.10-1~bpo13+1)
End-Date: 2026-04-16 08:45:39

Start-Date: 2026-04-16 08:45:47
Commandline: apt install linux-image-amd64
Install: linux-image-amd64:amd64 (6.12.74-2)
End-Date: 2026-04-16 08:45:49

Start-Date: 2026-04-18 11:27:13
Commandline: apt install -t trixie-backports linux-image-amd64
Install: linux-image-6.19.11+deb13-amd64:amd64 (6.19.11-1~bpo13+1,
automatic), linux-modules-6.19.11+deb13-amd64:amd64 (6.19.11-1~bpo13+1, automatic), linux-binary-6.19.11+deb13-amd64:amd64 (6.19.11-1~bpo13+1, automatic) Upgrade: linux-image-amd64:amd64 (6.12.74-2,
6.19.11-1~bpo13+1) End-Date: 2026-04-18 11:27:44

Start-Date: 2026-04-21 10:35:49
Commandline: apt purge linux-image-amd64
Purge: linux-image-amd64:amd64 (6.19.11-1~bpo13+1)
End-Date: 2026-04-21 10:35:52

Start-Date: 2026-04-21 10:36:06
Commandline: apt install linux-image-amd64
Install: linux-image-amd64:amd64 (6.12.74-2)
End-Date: 2026-04-21 10:36:08

Start-Date: 2026-05-01 04:40:22
Commandline: /usr/bin/unattended-upgrade
Install: linux-image-6.12.85+deb13-amd64:amd64 (6.12.85-1, automatic)
Upgrade: linux-image-amd64:amd64 (6.12.74-2, 6.12.85-1)
End-Date: 2026-05-01 04:40:45

Start-Date: 2026-05-09 04:49:58
Commandline: /usr/bin/unattended-upgrade
Install: linux-image-6.12.86+deb13-amd64:amd64 (6.12.86-1, automatic)

Start-Date: 2026-05-16 04:01:47
Commandline: /usr/bin/unattended-upgrade
Install: linux-image-6.12.88+deb13-amd64:amd64 (6.12.88-1, automatic)

Start-Date: 2026-05-24 11:54:07
Commandline: /usr/bin/unattended-upgrade
Install: linux-image-6.12.90+deb13-amd64:amd64 (6.12.90-1, automatic)

Start-Date: 2026-05-29 04:08:32
Commandline: /usr/bin/unattended-upgrade
Install: linux-image-6.12.90+deb13.1-amd64:amd64 (6.12.90-2, automatic)

Start-Date: 2026-06-20 11:19:40
Commandline: apt install -t trixie-backports linux-image-amd64

Apparently, twice in April I had tried the backports kernel(s), and
found them unsatisfactory. So possibly the fix came in between linux-image-6.19.11+deb13-amd64 and the current backports kernel, linux-image-7.0.10+deb13-amd64. Quite possibly a major version number
might even inadvertently fix something as subtle as this.

I will assume your script spawns a separate, isolated process for
each checksum file.

Correct. It does a find on *.sha256sums, *.sha512sums, and several other suffixes. It then sliced and dices to figure out the appropriate
program to call, sha256sum and sha512sum, respectively.

The files are created using paths relative to the current directory, so
when the script runs, it will pushd to that directory

The key line is

nice "${prog}" "${opts}" -c "${file}" &

Where $prog is the result of the slicing and dicing, opts='--quiet',
and $file the file to be scanned.

I recently added the option to limit the number of background tasks to
the number of processors (nproc --all). That reduces but does not
eliminate the number of errors.

If you have ruled out the power supply, memory, and disks, another possibility could be a race condition in the kernel and/or I/O stack
that is triggered when multiple processes access storage in parallel.

I wouldn't say those are all ruled out. But the fact that the
backport kernel is a major version number change, and that it appears
to have solved the problem is highly suggestive. With that caveat, I
concur. That is definitely something for the kernel folks to look at.

Are the checksum errors repeatable on another computer with a similar storage architecture and the previous kernel? If so, do they
disappear with the backports kernel?

I have a much more recent and much faster computer, peregrine, with nvme storage and 12 cores. Hawk has eight cores, and spinning rust. Hawk is
where the problem has shown up. Peregrine does not show the problem.

For 18 gig of data, hawk: 2m18.318s, peregrine 0m24.233s.

David

--
Does anybody read signatures any more?

https://charlescurley.com
https://charlescurley.com/blog/

--- PyGate Linux v1.5.17
* Origin: Dragon's Lair, PyGate NNTP<>Fido Gate (3:633/10)

From David Christensen@3:633/10 to All on Tuesday, June 23, 2026 00:50:02

On 6/22/26 09:54, Charles Curley wrote:

On Mon, 22 Jun 2026 00:32:19 -0700 David Christensen wrote:

Please run and post the following commands with both the previous
kernel and the backports kernel:

$ cat /etc/debian_version

$ uname -a

Backport

root@hawk:~# cat /etc/debian_version
13.5
root@hawk:~# cat /etc/os-release
PRETTY_NAME="Debian GNU/Linux 13 (trixie)"
NAME="Debian GNU/Linux"
VERSION_ID="13"
VERSION="13 (trixie)"
VERSION_CODENAME=trixie
DEBIAN_VERSION_FULL=13.5
ID=debian
HOME_URL="https://www.debian.org/" SUPPORT_URL="https://www.debian.org/support" BUG_REPORT_URL="https://bugs.debian.org/"
root@hawk:~# uname -s
Linux
root@hawk:~# uname -a
Linux hawk 7.0.10+deb13-amd64 #1 SMP PREEMPT_DYNAMIC Debian 7.0.10-1~bpo13+1 (2026-05-28) x86_64 GNU/Linux
root@hawk:~#

Prior kernel is the same except the kernel is

linux-image-6.12.90+deb13.1-amd64 6.12.90-2 amd64

Good information.

I think that this problem first showed up late April or early May. If
that is correct, the following extracts from my apt logs might help.
These were copied and pasted unwrapped, so unless they were mangled in transit, long lines should be preserved.

...

Apparently, twice in April I had tried the backports kernel(s), and
found them unsatisfactory. So possibly the fix came in between linux-image-6.19.11+deb13-amd64 and the current backports kernel, linux-image-7.0.10+deb13-amd64.

AIUI if someone can come up with a shell script or program whose exit
value reliably indicates the presence or absence of a bug, Git can do a
binary search over a range of commits and locate the commit where the
bug originated.

Quite possibly a major version number
might even inadvertently fix something as subtle as this.

Agreed.

I will assume your script spawns a separate, isolated process for
each checksum file.

Correct. It does a find on *.sha256sums, *.sha512sums, and several other suffixes. It then sliced and dices to figure out the appropriate
program to call, sha256sum and sha512sum, respectively.

The files are created using paths relative to the current directory, so
when the script runs, it will pushd to that directory

The key line is

nice "${prog}" "${opts}" -c "${file}" &

Where $prog is the result of the slicing and dicing, opts='--quiet',
and $file the file to be scanned.

I use a similar workflow for image files and also wrote a script to
generate and verify checksum files.

I recently added the option to limit the number of background tasks to
the number of processors (nproc --all). That reduces but does not
eliminate the number of errors.

That is another clue that there is a race condition related to parallel I/O.

If you have ruled out the power supply, memory, and disks, another
possibility could be a race condition in the kernel and/or I/O stack
that is triggered when multiple processes access storage in parallel.

I wouldn't say those are all ruled out. But the fact that the
backport kernel is a major version number change, and that it appears
to have solved the problem is highly suggestive. With that caveat, I
concur. That is definitely something for the kernel folks to look at.

Agreed.

Are the checksum errors repeatable on another computer with a similar
storage architecture and the previous kernel? If so, do they
disappear with the backports kernel?

I have a much more recent and much faster computer, peregrine, with nvme storage and 12 cores. Hawk has eight cores, and spinning rust. Hawk is
where the problem has shown up. Peregrine does not show the problem.

For 18 gig of data, hawk: 2m18.318s, peregrine 0m24.233s.

That clue makes me think the race condition is the SCSI stack.

It looks like there is a newer kernel for Trixie. It is best to file
bug reports against current packages. Can you test it?

https://packages.debian.org/stable/kernel/linux-image-6.12.94+deb13-amd64

David

--- PyGate Linux v1.5.17
* Origin: Dragon's Lair, PyGate NNTP<>Fido Gate (3:633/10)

From The Wanderer@3:633/10 to All on Tuesday, June 23, 2026 04:20:01

On 2026-06-22 at 18:40, David Christensen wrote:

On 6/22/26 09:54, Charles Curley wrote:

I think that this problem first showed up late April or early May.
If that is correct, the following extracts from my apt logs might
help. These were copied and pasted unwrapped, so unless they were
mangled in transit, long lines should be preserved.

...

Apparently, twice in April I had tried the backports kernel(s),
and found them unsatisfactory. So possibly the fix came in between
linux-image-6.19.11+deb13-amd64 and the current backports kernel,
linux-image-7.0.10+deb13-amd64.

AIUI if someone can come up with a shell script or program whose exit
value reliably indicates the presence or absence of a bug, Git can
do a binary search over a range of commits and locate the commit
where the bug originated.

...assuming that there aren't any commits broken for other reasons, or otherwise commits where the codebase can't be built far enough for the
script or program to be able to do its thing, that will be hit along the
way.
That could be folded in under "reliably", of course - but it's
sufficiently far out of the scope of what people might be expected to
think of for that term, if not previously familiar with what such
bisections can involve, that it seems worth calling out explicitly.
That said, git can actually do this even *without* such a
script/program, as long as you're willing and able to test each
candidate commit manually. The use of a script or program to automate it
is actually a subset of the functionality of the 'git bisect'
sub-command, specifically the sub-sub-command 'git bisect run'; see 'git
help bisect' for the documentation, most of which is about the version
of the process where the validation of each commit is done manually
rather than by a script.
--
The Wanderer
The reasonable man adapts himself to the world; the unreasonable one
persists in trying to adapt the world to himself. Therefore all
progress depends on the unreasonable man. -- George Bernard Shaw

--- PyGate Linux v1.5.17
* Origin: Dragon's Lair, PyGate NNTP<>Fido Gate (3:633/10)

From Charles Curley@3:633/10 to All on Friday, June 26, 2026 00:50:01

On Mon, 22 Jun 2026 15:40:11 -0700
David Christensen <dpchrist@holgerdanske.com> wrote:

It looks like there is a newer kernel for Trixie. It is best to file
bug reports against current packages. Can you test it?

https://packages.debian.org/stable/kernel/linux-image-6.12.94+deb13-amd64

Tested. It came up with all sorts of fails.

charles@hawk:~$ uname -a
Linux hawk 6.12.94+deb13-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.12.94-1 (2026-06-20) x86_64 GNU/Linux
charles@hawk:~$

I gather you are suggesting I file a bug against this kernel, linux-image-6.12.94+deb13-amd64 6.12.94-1.

--
Does anybody read signatures any more?

https://charlescurley.com
https://charlescurley.com/blog/

--- PyGate Linux v1.5.18
* Origin: Dragon's Lair, PyGate NNTP<>Fido Gate (3:633/10)

From David Christensen@3:633/10 to All on Friday, June 26, 2026 03:10:02

On 6/25/26 15:40, Charles Curley wrote:

On Mon, 22 Jun 2026 15:40:11 -0700
David Christensen <dpchrist@holgerdanske.com> wrote:

It looks like there is a newer kernel for Trixie. It is best to file
bug reports against current packages. Can you test it?

https://packages.debian.org/stable/kernel/linux-image-6.12.94+deb13-amd64

Tested. It came up with all sorts of fails.

charles@hawk:~$ uname -a
Linux hawk 6.12.94+deb13-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.12.94-1 (2026-06-20) x86_64 GNU/Linux
charles@hawk:~$

I gather you are suggesting I file a bug against this kernel, linux-image-6.12.94+deb13-amd64 6.12.94-1.

That would make sense, especially since Linux 6.12 appears to be the
kernel for Debian Stable (Trixie):

https://packages.debian.org/trixie/all/allpackages

David

--- PyGate Linux v1.5.18
* Origin: Dragon's Lair, PyGate NNTP<>Fido Gate (3:633/10)

From Charles Curley@3:633/10 to All on Friday, June 26, 2026 05:50:02

On Thu, 25 Jun 2026 18:07:21 -0700
David Christensen <dpchrist@holgerdanske.com> wrote:

That would make sense, especially since Linux 6.12 appears to be the
kernel for Debian Stable (Trixie):

https://packages.debian.org/trixie/all/allpackages

OK, will do.

However, things are back up in the air. I rebooted to 7.0.10, and ran
some backups. The prior testing has been all reading: checksum
verification. Nothing on the disk actually changed. This backs from the
SSD to the RAID array. I had several directories fail with the error
message "failed: Bad message (74)", whatever that means.

I immediately fscked the logical volume. No errors. I then diffed the directories against the originals. There were for instances of files
found in the backups and not in the originals. Otherwise the "failed" directories were intact and duplicated the originals. I conjecture that
rsync attempted to delete them and failed to do so. I conjecture that
the next pass will get them.

--
Does anybody read signatures any more?

https://charlescurley.com
https://charlescurley.com/blog/

--- PyGate Linux v1.5.18
* Origin: Dragon's Lair, PyGate NNTP<>Fido Gate (3:633/10)

From David Christensen@3:633/10 to All on Saturday, June 27, 2026 01:50:02

On 6/25/26 20:44, Charles Curley wrote:

On Thu, 25 Jun 2026 18:07:21 -0700 David Christensen wrote:

That would make sense, especially since Linux 6.12 appears to be the
kernel for Debian Stable (Trixie):

https://packages.debian.org/trixie/all/allpackages

OK, will do.

Thank you.

However, things are back up in the air. I rebooted to 7.0.10, and ran
some backups. The prior testing has been all reading: checksum
verification. Nothing on the disk actually changed. This backs from the
SSD to the RAID array. I had several directories fail with the error
message "failed: Bad message (74)", whatever that means.

I immediately fscked the logical volume. No errors. I then diffed the directories against the originals. There were for instances of files
found in the backups and not in the originals. Otherwise the "failed" directories were intact and duplicated the originals. I conjecture that
rsync attempted to delete them and failed to do so. I conjecture that
the next pass will get them.

Are you booting the OS disk and running backups, or are you booting live
media and backing up?

Please post a console session that shows prompts, commands entered, and
output displayed. If you are using shell scripts, please enable the
"xtrace" option.

David

--- PyGate Linux v1.5.18
* Origin: Dragon's Lair, PyGate NNTP<>Fido Gate (3:633/10)

Who's Online
Recent Visitors
- Wang Bu
  Monday, June 22, 2026 08:10:33
  from Manila, Philippines via Telnet
- Wang Bu
  Monday, June 22, 2026 07:54:48
  from Manila, Philippines via Telnet
- Wang Bu
  Saturday, June 20, 2026 19:49:49
  from Manila, Philippines via Telnet
- Wang Bu
  Sunday, June 14, 2026 19:13:00
  from Manila, Philippines via Telnet

System Info

Sysop:	Jacob Catayoc
Location:	Pasay City, Metro Manila, Philippines
Users:	4
Nodes:	4 (0 / 4)
Uptime:	495145:23:04
Calls:	165
Files:	574
D/L today:	29 files (9,998K bytes)
Messages:	78,198

Schr�dinger's hash

Who's Online

Recent Visitors

System Info