I have four four terabyte hard drives. Each has a partition on it. The
four partitions comprise a RAID 5 array using mdadm. On top of that,
LUKS encryption, then LVM with ext4 logical volumes.
On one LVM partition I have a number of backup files, tarred,
bzipped, and sha256 and sha512 summed. I have a script which will find checksum files, and execute the appropriate program to test the
archives. It puts each program into the background, parallising any
number of checksum tests.
Starting about a week ago, the script finds an error in one or more
files out of several. Results are inconsistent: one pass may find an
error in a given file, the next pass not find any errors in it. Running checksums manually, one at a time, does not turn up an error. Running
"tar tvf" finds no error in a suspect file. Running "bunzip2 -t" also
turns up no error. Only running the script turns up any errors.
I create two checksum files when I create the backups, for sha256 and
sha512. After this problem surfaced (about a week ago), I then made two
new checksum files of a suspect file. The two checksum file pairs
(e.g. both sha512sum files) show the same checksums. The script now
tests using both the old and new checksum files. Sometime only one pair
of checksum files fail the suspect file.
In addition to all of that, I also get the occasional "bad message"
error. I have no idea what that means, but an fsck seems to deal with
it.
To be thorough, I have run extended SMART tests on the hard drives,
kicked mdadm into testing the RAID array, and fscked the LVM partitions
on the RAID array. Only fsck turned up issues, and that has not stopped.
I also back some of this up to offsite USB drives. I ran the script on
one of those, using a different computer. No errors reported.
I have a hypothesis as to what is going on, but would like to hear from
you before I discuss it.
--
Does anybody read signatures any more?
https://charlescurley.com
https://charlescurley.com/blog/
Starting about a week ago, the script finds an error in one or moreJust for clarity: where (from what source), and when (at what point), is
files out of several. Results are inconsistent: one pass may find an
error in a given file, the next pass not find any errors in it.
Running checksums manually, one at a time, does not turn up an error.
Running "tar tvf" finds no error in a suspect file. Running "bunzip2
-t" also turns up no error. Only running the script turns up any
errors.
I create two checksum files when I create the backups, for sha256
and sha512. After this problem surfaced (about a week ago), I then
made two new checksum files of a suspect file. The two checksum file
pairs (e.g. both sha512sum files) show the same checksums. The script
now tests using both the old and new checksum files. Sometime only
one pair of checksum files fail the suspect file.
In addition to all of that, I also get the occasional "bad message"
error. I have no idea what that means, but an fsck seems to deal
with it.
To be thorough, I have run extended SMART tests on the hard drives,The very first thing that came to my mind out of that was RAM issues.
kicked mdadm into testing the RAID array, and fscked the LVM
partitions on the RAID array. Only fsck turned up issues, and that
has not stopped.
I have a hypothesis as to what is going on, but would like to hear
from you before I discuss it.
I had an issue some months back. It turned out to be a bad RAM stick
in my NAS. The issues would not show up on a restart but after some
usage it would hit the RAM errors and :(
On Fri, 22 May 2026 10:05:56 -0600
Andrew Latham <lathama@gmail.com> wrote:
I had an issue some months back. It turned out to be a bad RAM stick
in my NAS. The issues would not show up on a restart but after some
usage it would hit the RAM errors and :(
This is not impossible. I recently had some RAM go bad, failing
memtest. I have replaced it with new RAM, which does not fail
memtest. Maybe I should let it run for several passes.
Thanks.
--
Does anybody read signatures any more?
https://charlescurley.com
https://charlescurley.com/blog/
Yes, I should have added that this RAM was only failing when warm/hot
which was not fun to discover.
The very first thing that came to my mind out of that was RAM issues.
Disk issues was the second, but the tests you've run there seem as if
they'd probably have ruled that out.
If you run a script to generate the hash of a given file in a loop
(possibly with a don't-overload-the-system pause in between if you
prefer), does it always show the same hash, or does it sometimes show
a different one?
I had an issue some months back. It turned out to be a bad RAM stick
in my NAS.
I have four four terabyte hard drives. Each has a partition on it. The
four partitions comprise a RAID 5 array using mdadm. On top of that,
LUKS encryption, then LVM with ext4 logical volumes.
On one LVM partition I have a number of backup files, tarred,
bzipped, and sha256 and sha512 summed. I have a script which will find checksum files, and execute the appropriate program to test the
archives. It puts each program into the background, parallising any
number of checksum tests.
Starting about a week ago, the script finds an error in one or more
files out of several. Results are inconsistent: one pass may find an
error in a given file, the next pass not find any errors in it. Running checksums manually, one at a time, does not turn up an error. Running
"tar tvf" finds no error in a suspect file. Running "bunzip2 -t" also
turns up no error. Only running the script turns up any errors.
I create two checksum files when I create the backups, for sha256 and
sha512. After this problem surfaced (about a week ago), I then made two
new checksum files of a suspect file. The two checksum file pairs
(e.g. both sha512sum files) show the same checksums. The script now
tests using both the old and new checksum files. Sometime only one pair
of checksum files fail the suspect file.
In addition to all of that, I also get the occasional "bad message"
error. I have no idea what that means, but an fsck seems to deal with
it.
To be thorough, I have run extended SMART tests on the hard drives,
kicked mdadm into testing the RAID array, and fscked the LVM partitions
on the RAID array. Only fsck turned up issues, and that has not stopped.
I also back some of this up to offsite USB drives. I ran the script on
one of those, using a different computer. No errors reported.
I have a hypothesis as to what is going on, but would like to hear from
you before I discuss it.
I had an issue some months back. It turned out to be a bad RAM stick
in my NAS. The issues would not show up on a restart but after some
usage it would hit the RAM errors and :(
This is not impossible. I recently had some RAM go bad, failing
memtest. I have replaced it with new RAM, which does not fail
memtest. Maybe I should let it run for several passes.
I have four four terabyte hard drives. Each has a partition on it.I remarked to a local computer repair shop that I have a four TB backup
The
four partitions comprise a RAID 5 array using mdadm. On top of that,
LUKS encryption, then LVM with ext4 logical volumes.
On one LVM partition I have a number of backup files, tarred,
bzipped, and sha256 and sha512 summed. I have a script which will
find
checksum files, and execute the appropriate program to test the
archives. It puts each program into the background, parallising any
number of checksum tests.
Starting about a week ago, the script finds an error in one or more
files out of several. Results are inconsistent: one pass may find an
error in a given file, the next pass not find any errors in it.
Running
checksums manually, one at a time, does not turn up an error. Running
"tar tvf" finds no error in a suspect file. Running "bunzip2 -t" also
turns up no error. Only running the script turns up any errors.
I create two checksum files when I create the backups, for sha256 and
sha512. After this problem surfaced (about a week ago), I then made
two
new checksum files of a suspect file. The two checksum file pairs
(e.g. both sha512sum files) show the same checksums. The script now
tests using both the old and new checksum files. Sometime only one
pair
of checksum files fail the suspect file.
In addition to all of that, I also get the occasional "bad message"
error. I have no idea what that means, but an fsck seems to deal with
it.
To be thorough, I have run extended SMART tests on the hard drives,
kicked mdadm into testing the RAID array, and fscked the LVM
partitions
on the RAID array. Only fsck turned up issues, and that has not
stopped.
I also back some of this up to offsite USB drives. I ran the script
on
one of those, using a different computer. No errors reported.
I have a hypothesis as to what is going on, but would like to hear
from
you before I discuss it.
I remarked to a local computer repair shop that I have a four TBHow so? I though 4TB is showing its age...
backup drive. He said "replace it. Four TB isn't ready yet."
May I ask, was this ECC RAM?
My personal NAS contains 16 sticks of 16GB DDR3 registered RAM. It is logging a CE memory scrubbing error once or twice a day since 700+
days.
It is always the same page/address, triggering a soft offline of that
memory page.
journalctl | grep -i RAM
On 5/22/26 3:02 PM, Charles Curley wrote:
journalctl | grep -i RAM
Sure enough, that gets me a boatload of RAM error reports on my
server.
On my desktop without ECC it does not.
I think no noise = good,
however, I have rasdaemon installed on the server, I think it may
take a combination of that + ECC to make the RAM errors log.
It's been a while since I set this up. I think I had
to change a setting in the Dell bios to prevent its log from eating
the error instead of handing it to the os.
I was simply reading sudo dmesg.
If I'm correct, memtest86 is nearly useless on ECC RAM.
If I'm correct, memtest86 is nearly useless on ECC RAM.Maybe.
"MemTest86 directly polls ECC errors logged in the chipset/memory
controller registers and displays it to the user on-screen. In
addition, ECC errors are written to the log and report file.
"During testing, MemTest86 may report ECC errors detected by the memory controller if ECC is supported and enabled. This is demonstrated in the following screenshot:"
https://www.memtest86.com/ecc.htm#memtest86
On 5/22/26 1:32 PM, Van Snyder wrote:
I remarked to a local computer repair shop that I have a four TB
backup drive. He said "replace it. Four TB isn't ready yet."
How so? I though 4TB is showing its age...
I'm running 12x 4TB drives. Used SAS drives. Accumulated power on time ranges from 40,166 to 73,439 hours.
Smartctl informs me device /dev/sdf is worsening with increased read
errors over time. That one shows 73408 hours powered up, 72360.67 GB
read, 119545.193 GB written, 195 power cycles (13 since July 13 2024). Defect list increased from 3 to 6872.
I see two other drives have defect lists of 23 and 14, respectively. All others are at 0. Considering that, I should probably prioritize
replacing at least sdf soon to avoid losing redundancy during resilver, considering the age of the pool.
nwe@srv01:~$ zpool status -c vendor,model,size
pool: POOL1
state: ONLINE
scan: scrub repaired 0B in 04:10:05 with 0 errors on Sun May 10 04:34:06 2026
config:
NAMEÿ ÿ ÿ ÿ STATEÿ ÿ ÿREAD WRITE CKSUMÿ ÿvendorÿ ÿ ÿ ÿ ÿmodelÿ size
POOL1ÿ ÿ ÿ ÿONLINEÿ ÿ ÿ ÿ0ÿ ÿ ÿ0ÿ ÿ ÿ0
raidz3-0ÿ ONLINEÿ ÿ ÿ ÿ0ÿ ÿ ÿ0ÿ ÿ ÿ0
sdbÿ ÿ ÿONLINEÿ ÿ ÿ ÿ0ÿ ÿ ÿ0ÿ ÿ ÿ0ÿ SEAGATEÿ ST4000NM0023ÿ 3.6T
sdcÿ ÿ ÿONLINEÿ ÿ ÿ ÿ0ÿ ÿ ÿ0ÿ ÿ ÿ0ÿ TOSHIBAÿ ÿMG04SCA40ENÿ 3.6T
sddÿ ÿ ÿONLINEÿ ÿ ÿ ÿ0ÿ ÿ ÿ0ÿ ÿ ÿ0ÿ TOSHIBAÿ ÿMG04SCA40ENÿ 3.6T
sdeÿ ÿ ÿONLINEÿ ÿ ÿ ÿ0ÿ ÿ ÿ0ÿ ÿ ÿ0ÿ SEAGATEÿ ST4000NM0023ÿ 3.6T
sdfÿ ÿ ÿONLINEÿ ÿ ÿ ÿ0ÿ ÿ ÿ0ÿ ÿ ÿ0ÿ SEAGATEÿ ST4000NM0023ÿ 3.6T
sdhÿ ÿ ÿONLINEÿ ÿ ÿ ÿ0ÿ ÿ ÿ0ÿ ÿ ÿ0ÿ ÿ ÿ ÿHPÿ ÿMB4000FCWDKÿ 3.6T
sdgÿ ÿ ÿONLINEÿ ÿ ÿ ÿ0ÿ ÿ ÿ0ÿ ÿ ÿ0ÿ ÿ ÿ ÿHPÿ ÿMB4000FCWDKÿ 3.6T
sdiÿ ÿ ÿONLINEÿ ÿ ÿ ÿ0ÿ ÿ ÿ0ÿ ÿ ÿ0ÿ SEAGATEÿ ST4000NM0023ÿ 3.6T
sdjÿ ÿ ÿONLINEÿ ÿ ÿ ÿ0ÿ ÿ ÿ0ÿ ÿ ÿ0ÿ SEAGATEÿ ST4000NM0023ÿ 3.6T
sdkÿ ÿ ÿONLINEÿ ÿ ÿ ÿ0ÿ ÿ ÿ0ÿ ÿ ÿ0ÿ SEAGATEÿ ST4000NM0023ÿ 3.6T
sdlÿ ÿ ÿONLINEÿ ÿ ÿ ÿ0ÿ ÿ ÿ0ÿ ÿ ÿ0ÿ ÿ ÿ ÿHPÿ ÿMB4000FCWDKÿ 3.6T
sdmÿ ÿ ÿONLINEÿ ÿ ÿ ÿ0ÿ ÿ ÿ0ÿ ÿ ÿ0ÿ ÿ ÿ ÿHPÿ ÿMB4000FCWDKÿ 3.6T
errors: No known data errors
On 5/22/26 12:43, nwe wrote:
On 5/22/26 1:32 PM, Van Snyder wrote:
I remarked to a local computer repair shop that I have a four TB
backup drive. He said "replace it. Four TB isn't ready yet."
How so? I though 4TB is showing its age...
+1ÿ I am also curious why 4 TB HDD's are not "ready yet".The repair guy who made that remark to me didn't explain why he
I'm running 12x 4TB drives. Used SAS drives. Accumulated power on
time
ranges from 40,166 to 73,439 hours.
Smartctl informs me device /dev/sdf is worsening with increased
read
errors over time. That one shows 73408 hours powered up, 72360.67
GB
read, 119545.193 GB written, 195 power cycles (13 since July 13
2024).
Defect list increased from 3 to 6872.
I see two other drives have defect lists of 23 and 14,
respectively. All
others are at 0. Considering that, I should probably prioritize
replacing at least sdf soon to avoid losing redundancy during
resilver,
considering the age of the pool.
I am still trying to understand the smartctl(8) "SMART Attributes
Data
Structure".ÿ The RAW_VALUE seems to be a binary bit field (?) for
several attributes and is useless without manufacturer engineering
data.
ÿ The VALUE column is supposed to be a percentage that starts at 100%
and goes down to 0% as the disk wears out:
* Raw_Read_Error_Rate, Seek_Error_Rate, and Hardware_ECC_Recovered
can
have low VALUE numbers, but the disk seems to keep working.
* Low VALUE numbers for Reallocated_Sector_Ct,
Current_Pending_Sector,
and/or Offline_Uncorrectable seem to be reliable indicators of a
failing
disk.
* I have not seen a VALUE number other than 100% for End-to-
End_Error.
nwe@srv01:~$ zpool status -c vendor,model,size
pool: POOL1
state: ONLINE
scan: scrub repaired 0B in 04:10:05 with 0 errors on Sun May 10
04:34:06
2026
config:
NAMEÿ ÿ ÿ ÿ STATEÿ ÿ ÿREAD WRITE CKSUMÿ ÿvendorÿ ÿ ÿ ÿ ÿmodelÿ size
POOL1ÿ ÿ ÿ ÿONLINEÿ ÿ ÿ ÿ0ÿ ÿ ÿ0ÿ ÿ ÿ0
raidz3-0ÿ ONLINEÿ ÿ ÿ ÿ0ÿ ÿ ÿ0ÿ ÿ ÿ0
sdbÿ ÿ ÿONLINEÿ ÿ ÿ ÿ0ÿ ÿ ÿ0ÿ ÿ ÿ0ÿ SEAGATEÿ ST4000NM0023ÿ 3.6T
sdcÿ ÿ ÿONLINEÿ ÿ ÿ ÿ0ÿ ÿ ÿ0ÿ ÿ ÿ0ÿ TOSHIBAÿ ÿMG04SCA40ENÿ 3.6T
sddÿ ÿ ÿONLINEÿ ÿ ÿ ÿ0ÿ ÿ ÿ0ÿ ÿ ÿ0ÿ TOSHIBAÿ ÿMG04SCA40ENÿ 3.6T
sdeÿ ÿ ÿONLINEÿ ÿ ÿ ÿ0ÿ ÿ ÿ0ÿ ÿ ÿ0ÿ SEAGATEÿ ST4000NM0023ÿ 3.6T
sdfÿ ÿ ÿONLINEÿ ÿ ÿ ÿ0ÿ ÿ ÿ0ÿ ÿ ÿ0ÿ SEAGATEÿ ST4000NM0023ÿ 3.6T
sdhÿ ÿ ÿONLINEÿ ÿ ÿ ÿ0ÿ ÿ ÿ0ÿ ÿ ÿ0ÿ ÿ ÿ ÿHPÿ ÿMB4000FCWDKÿ 3.6T
sdgÿ ÿ ÿONLINEÿ ÿ ÿ ÿ0ÿ ÿ ÿ0ÿ ÿ ÿ0ÿ ÿ ÿ ÿHPÿ ÿMB4000FCWDKÿ 3.6T
sdiÿ ÿ ÿONLINEÿ ÿ ÿ ÿ0ÿ ÿ ÿ0ÿ ÿ ÿ0ÿ SEAGATEÿ ST4000NM0023ÿ 3.6T
sdjÿ ÿ ÿONLINEÿ ÿ ÿ ÿ0ÿ ÿ ÿ0ÿ ÿ ÿ0ÿ SEAGATEÿ ST4000NM0023ÿ 3.6T
sdkÿ ÿ ÿONLINEÿ ÿ ÿ ÿ0ÿ ÿ ÿ0ÿ ÿ ÿ0ÿ SEAGATEÿ ST4000NM0023ÿ 3.6T
sdlÿ ÿ ÿONLINEÿ ÿ ÿ ÿ0ÿ ÿ ÿ0ÿ ÿ ÿ0ÿ ÿ ÿ ÿHPÿ ÿMB4000FCWDKÿ 3.6T
sdmÿ ÿ ÿONLINEÿ ÿ ÿ ÿ0ÿ ÿ ÿ0ÿ ÿ ÿ0ÿ ÿ ÿ ÿHPÿ ÿMB4000FCWDKÿ 3.6T
errors: No known data errors
Twelve disks gives you many choices for how to layout the pool and
trade-off redundancy vs. capacity vs. performance.ÿ Is the data
balanced
across disks?ÿ Does the machine have enough memory?ÿ Is the ARC
working
well?
On two of my earlier pools, I added a 60 GB SSD as a cache vdev after
the pools had data.ÿ I did not notice any improvement.
On one of my earlier pools of one mirror of two 3 TB HDD's that was
nearly full, I added another mirror of two 3 TB HDD's.ÿ I did not
notice
any improvement.
I rebuilt the storage pool with two mirrors of two 3 TB HDD's each
and a
special vdev mirror of two 180 GB SSD's.ÿ I also set special_small_blocks=16K.ÿ I then restored the data via replication.
The data is now balanced across disks, latency has dropped,
throughput
has increased, and overall performance is noticeably better:
2026-05-22 15:12:45 toor@f5 ~
# freebsd-version
13.5-RELEASE-p12
2026-05-22 15:19:47 toor@f5 ~
# zpool iostat -v p5
ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿ capacityÿÿÿÿ operationsÿÿÿÿ bandwidth
poolÿÿÿÿÿÿÿÿÿÿÿÿÿ allocÿÿ freeÿÿ readÿ writeÿÿ readÿ write
----------------ÿ -----ÿ -----ÿ -----ÿ -----ÿ -----ÿ -----
p5ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿ 3.76Tÿ 1.82Tÿÿÿÿÿ 6ÿÿÿÿÿ 1ÿ 3.68Mÿ 32.2K
ÿÿ mirror-0ÿÿÿÿÿÿÿ 1.87Tÿÿ 871Gÿÿÿÿÿ 2ÿÿÿÿÿ 0ÿ 1.82Mÿ 4.48K
ÿÿÿÿ gpt/hdd0.eliÿÿÿÿÿ -ÿÿÿÿÿ -ÿÿÿÿÿ 1ÿÿÿÿÿ 0ÿÿ 931Kÿ 2.24K
ÿÿÿÿ gpt/hdd1.eliÿÿÿÿÿ -ÿÿÿÿÿ -ÿÿÿÿÿ 1ÿÿÿÿÿ 0ÿÿ 931Kÿ 2.24K
ÿÿ mirror-1ÿÿÿÿÿÿÿ 1.86Tÿÿ 876Gÿÿÿÿÿ 2ÿÿÿÿÿ 0ÿ 1.81Mÿ 4.35K
ÿÿÿÿ gpt/hdd2.eliÿÿÿÿÿ -ÿÿÿÿÿ -ÿÿÿÿÿ 1ÿÿÿÿÿ 0ÿÿ 928Kÿ 2.18K
ÿÿÿÿ gpt/hdd3.eliÿÿÿÿÿ -ÿÿÿÿÿ -ÿÿÿÿÿ 1ÿÿÿÿÿ 0ÿÿ 928Kÿ 2.18K
specialÿÿÿÿÿÿÿÿÿÿÿÿÿÿ -ÿÿÿÿÿ -ÿÿÿÿÿ -ÿÿÿÿÿ -ÿÿÿÿÿ -ÿÿÿÿÿ -
ÿÿ mirror-2ÿÿÿÿÿÿÿ 31.1Gÿÿ 118Gÿÿÿÿÿ 1ÿÿÿÿÿ 1ÿ 51.2Kÿ 23.3K
ÿÿÿÿ gpt/ssd0.eliÿÿÿÿÿ -ÿÿÿÿÿ -ÿÿÿÿÿ 0ÿÿÿÿÿ 0ÿ 25.6Kÿ 11.7K
ÿÿÿÿ gpt/ssd1.eliÿÿÿÿÿ -ÿÿÿÿÿ -ÿÿÿÿÿ 0ÿÿÿÿÿ 0ÿ 25.6Kÿ 11.7K
----------------ÿ -----ÿ -----ÿ -----ÿ -----ÿ -----ÿ -----ÿ -----
2026-05-22 15:32:42 toor@f5 ~
# top -d 1 | head -n 7
last pid: 57622;ÿ load averages:ÿ 0.24,ÿ 0.21,ÿ 0.17ÿ up 24+22:47:05 15:32:45
27 processes:ÿ 1 running, 26 sleeping
CPU:ÿ 0.0% user,ÿ 0.0% nice,ÿ 0.6% system,ÿ 0.0% interrupt, 99.4%
idle
Mem: 4848K Active, 330M Inact, 856K Laundry, 14G Wired, 920M Buf,
694M Free
ARC: 12G Total, 10G MFU, 485M MRU, 3328K Anon, 200M Header, 899M
Other
ÿÿÿÿÿ 9921M Compressed, 33G Uncompressed, 3.36:1 Ratio
Swap: 764M Total, 764M Free
2026-05-22 15:33:12 toor@f5 ~
# arc_summary | grep -A 5 "ARC total accesses"
ARC total accesses (hits + misses):ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿ
512.7M
ÿÿÿÿÿÿÿÿ Cache hit ratio:ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿ 99.8 %ÿÿÿÿ
511.8M
ÿÿÿÿÿÿÿÿ Cache miss ratio:ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿ 0.2 %ÿÿÿÿ
886.5k
ÿÿÿÿÿÿÿÿ Actual hit ratio (MFU + MRU hits):ÿÿÿÿÿÿÿÿÿÿÿÿ 99.3 %ÿÿÿÿ
509.2M
ÿÿÿÿÿÿÿÿ Data demand efficiency:ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿ 99.5 %ÿÿÿÿÿÿ
4.8M
ÿÿÿÿÿÿÿÿ Data prefetch efficiency:ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿ 19.2 %ÿÿÿÿÿ
96.9k
In hindsight:
1.ÿ I gathered file system statistics prior rebuilding the pool and
setting special_small_blocks=16K, but it now appears I could have
used a
larger value.
2.ÿ If I get worried about HDD's failing, I can add disks to the pool
as
spares and/or add disks to the data mirrors.ÿ The latter should
increase
read performance even more.
3.ÿ My ~10 year old HDD's can already saturate Gigabit with
sequential
I/O.ÿ RAID 10 with SSD acceleration is even more overkill.ÿ I want 10
GbE.
David
On 5/22/26 3:02 PM, Charles Curley wrote:
journalctl | grep -i RAM
Sure enough, that gets me a boatload of RAM error reports on my server.
On my desktop without ECC it does not. I think no noise = good, however,
I have rasdaemon installed on the server, I think it may take a
combination of that + ECC to make the RAM errors log. It's been a while since I set this up. I think I had to change a setting in the Dell bios
to prevent its log from eating the error instead of handing it to the os.
I was simply reading sudo dmesg.
If I'm correct, memtest86 is nearly useless on ECC RAM.
I am still trying to understand the smartctl(8) "SMART Attributes Data Structure".ÿ The RAW_VALUE seems to be a binary bit field (?)
Twelve disks gives you many choices for how to layout the pool and
trade-off redundancy vs. capacity vs. performance.ÿ Is the data
balanced across disks?ÿ Does the machine have enough memory?ÿ Is the
ARC working well?
On 5/22/26 6:25 PM, David Christensen wrote:
I am still trying to understand the smartctl(8) "SMART Attributes Data
Structure".ÿ The RAW_VALUE seems to be a binary bit field (?)
same here
some makes/models of hardware seem to produce a greater quantity of comprehensible smart data
I run like
# smartctl -x /dev/sdf
returns additional data.
Twelve disks gives you many choices for how to layout the pool and
trade-off redundancy vs. capacity vs. performance.ÿ Is the data
balanced across disks?ÿ Does the machine have enough memory?ÿ Is the
ARC working well?
It has 256GB RAM.
My desktop pc currently has only 1Gb networking ever since I replaced my fiber optic card with a gpu in the lone PCIe slot. During the time I had
10g networking direct from server to workstation, I recall easily
saturating the network. Amazing speed, but I needed the gpu more. 10g networking is faster than the cheap SSDs in most of my PCs.
If you have a USB 3.x A or C port, various manufacturers make 2.5, 5,I've looked at those, been thinking of trying it some time. The only
and 10 GbE (copper RJ-45) Ethernet adapters.ÿ If you have a
Thunderbolt 3, 4, or 5 port, a few make 10 and 25 Gbps SFPx fiber
single and dual Ethernet adapters.ÿ Be sure to verify Debian and Linux driver support with the manufacturer before purchasing:
https://www.amazon.com/s?k=thunderbolt+sfp+ethernet
https://www.sonnettech.com/product/twin25g/overview.html
I have been using Intel SSD 520 Series 2.5" SATA III drives for manySame here, just discovered them cheap on ebay a few years ago. Some show
years.
How do you back up a 36 TB pool?
On 5/23/26 12:44 AM, David Christensen wrote:
If you have a USB 3.x A or C port, various manufacturers make 2.5, 5,
and 10 GbE (copper RJ-45) Ethernet adapters.ÿ If you have a
Thunderbolt 3, 4, or 5 port, a few make 10 and 25 Gbps SFPx fiber
single and dual Ethernet adapters.
I've looked at those, been thinking of trying it some time. The only
time I really wish for faster than 1g networking straight to my
workstation is when I'm cloning a complete hard drive to a backup image
file in the pool.
The only glitch I've run into so far is I've got to match the correct
optics to the cards, speaking of vendor-lock-in. Intel cards want intel optics. Most other brands seem to accept most other brands optics. No experience yet with Cisco brand. Dell can come as intel or other.
Cheap managed network switches from aliexpress two 10g SFP+ ports plus
four 1g RJ45, I guess I got what I paid for. They seem to mostly work, I
had a few random failures along the way. I've read scare stories about
these potentially dialing home etc. I have not confirmed such. I suspect these can't as I have them configured on an isolated vlan.
One wonderful network switch deal I find common on ebay: OS6450-P48 it
is cheap, 48 poe 1gig ports plus two 10g SFP ports that don't seem real picky what brand optics I use. It is a bit technical to configure.
I have been using Intel SSD 520 Series 2.5" SATA III drives for many
years.
Same here, just discovered them cheap on ebay a few years ago. Some show
up with SMART reports indicating wear that would trigger me replacing a consumer grade ssd. I've been using those nearly daily over 2 years with
no failures yet.
How do you back up a 36 TB pool?
(blush)
I know my current backup scheme could someday bite:
1. I have a full rsync backup on a spare R710 server in another
building. That server probably hasn't been powered up in 18 months. All
I need to do is go plug it into an outlet then spend a couple minutes at
a ssh session from the comfort of my office chair.
2. The most critical files get regularly backed up (manually) to a
remote storage vps.
I compress the files into an encrypted archive then upload using scp.
3. Some random memory sticks and hard drives...
So I've got some coverage in case of disaster but it could use improvement.
I know automatic backups could be nice but I just don't trust such.
I also have a clone of the server's boot drive.
nwe
If I'm correct, memtest86 is nearly useless on ECC RAM.
On 5/22/26 13:22, nwe wrote:
On 5/22/26 3:02 PM, Charles Curley wrote:
If I'm correct, memtest86 is nearly useless on ECC RAM.
MemTest86 v11.7 Free Edition claims to support ECC:
https://www.memtest86.com/compare.html
AFAICT memtest86+ does not support ECC. Some people suggest disabling ECC
in BIOS/UEFI Setup and then testing with memtest86+.
Hi,
The only machines I have ECC RAM in also have Machine Check Exception messages go to a log in the firmware. I have experienced running
memtest86 for a few successful complete passes and then finding messages
in the firmware log about things that were corrected, which enabled me
to locate the bad stick.
Also, most of the time I've had RAM fail it's done so in a way that ECC
can't fix, because ECC can only correct a single bit flip.
So, I have continued ti find memtest86 useful although it would be a bit
less so if I had no way to see the MCE log.
I had an issue some months back. It turned out to be a bad RAM stick
in my NAS. The issues would not show up on a restart but after some
usage it would hit the RAM errors and :(
This may not be your issue, but I remember how annoying it was to
figure out.
On Fri, 22 May 2026 10:05:56 -0600
Andrew Latham <lathama@gmail.com> wrote:
I had an issue some months back. It turned out to be a bad RAM stick
in my NAS. The issues would not show up on a restart but after some
usage it would hit the RAM errors and :(
This may not be your issue, but I remember how annoying it was to
figure out.
Thanks. I tried testing for this. I recently had one of two RAM sticks
go bad. I bought replacements in April and installed them. To test for
one of the new sticks being bad, I put the one good stick in instead of
the two new ones. The problem shows up. So I'm not sure whether I have
bad RAM or not.
--
Does anybody read signatures any more?
https://charlescurley.com
https://charlescurley.com/blog/
To be thorough, I have run extended SMART tests on the hard drives,
kicked mdadm into testing the RAID array, and fscked the LVM
partitions on the RAID array. Only fsck turned up issues, and that
has not stopped.
On Fri, 22 May 2026 09:53:17 -0600ys + 9 hours)",
Charles Curley <charlescurley@charlescurley.com> wrote:
To be thorough, I have run extended SMART tests on the hard drives,
kicked mdadm into testing the RAID array, and fscked the LVM
partitions on the RAID array. Only fsck turned up issues, and that
has not stopped.
Some additional testing.
Suspecting a bad hard drive, I ran more extended tests on all four
members of the RAID array. One showed problems:
"Error 1 [0] occurred at disk power-on lifetime: 6777 hours (282 da
" When the command that caused the error occurred, the device wasactive or idle.",
"",LBA = 0x00000000 = 0",
" After command completion occurred, registers were:",
" ER -- ST COUNT LBA_48 LH LM LL DV DC",
" -- -- -- == -- == == == -- -- -- -- --",
" 40 -- 51 00 01 00 00 00 00 00 00 40 00 Error: UNC 1 sectors at
"",/Feature_Name",
" Commands leading to the command that caused the error were:",
" CR FEATR COUNT LBA_48 LH LM LL DV DC Powered_Up_Time Command
" -- == -- == -- == == == -- -- -- -- -- --------------- --------------------",
" 25 00 00 00 01 00 00 00 00 00 00 40 00 00:08:36.585 READ DMA EXT",
" ec 00 00 00 00 00 00 00 00 00 00 00 00 00:08:31.545 IDENTIFY DEVICE",
" b0 00 da 00 00 00 00 00 c2 4f 00 00 00 00:08:31.542 SMART RETURN STATUS",
" b0 00 d2 00 f1 00 00 00 c2 4f 00 00 00 00:08:31.541 SMART ENABLE/DISABLE ATTRIBUTE AUTOSAVE",
" ec 00 00 00 00 00 00 00 00 00 00 00 00 00:08:31.541 IDENTIFY DEVICE",
"",me(hours) LBA_of_first_error",
"SMART Extended Self-test Log Version: 1 (1 sectors)",
"Num Test_Description Status Remaining LifeTi
"# 1 Extended offline Completed without error 00% 6756 -",
"# 2 Extended offline Completed without error 00% 6573 -",
"# 3 Extended offline Completed without error 00% 102 -",
"# 4 Short offline Completed without error 00%96 -",
"",
So I did the obvious: I failed and remove the drive from the array. The problem still showed up, but not as many fails in the same data set.
I have since added the drive back to the array, and am testing the
array now.
mdadm --monitor --test --oneshot /dev/md0
I begin to wonder if I have a bad motherboard.
--
Does anybody read signatures any more?
https://charlescurley.com
https://charlescurley.com/blog/
I hate to suggest other tangents but re-seat all the connectors and
maybe a power supply test. A brown-out of power would cause issues
like this.
Some additional testing.
Suspecting a bad hard drive, I ran more extended tests on all four
members of the RAID array. One showed problems:
"Error 1 [0] occurred at disk power-on lifetime: 6777 hours (282 days + 9 hours)",
" When the command that caused the error occurred, the device was active or idle.",
"",
" After command completion occurred, registers were:",
" ER -- ST COUNT LBA_48 LH LM LL DV DC",
" -- -- -- == -- == == == -- -- -- -- --",
" 40 -- 51 00 01 00 00 00 00 00 00 40 00 Error: UNC 1 sectors at LBA = 0x00000000 = 0",
"",
" Commands leading to the command that caused the error were:",
" CR FEATR COUNT LBA_48 LH LM LL DV DC Powered_Up_Time Command/Feature_Name",
" -- == -- == -- == == == -- -- -- -- -- --------------- --------------------",
" 25 00 00 00 01 00 00 00 00 00 00 40 00 00:08:36.585 READ DMA EXT",
" ec 00 00 00 00 00 00 00 00 00 00 00 00 00:08:31.545 IDENTIFY DEVICE",
" b0 00 da 00 00 00 00 00 c2 4f 00 00 00 00:08:31.542 SMART RETURN STATUS",
" b0 00 d2 00 f1 00 00 00 c2 4f 00 00 00 00:08:31.541 SMART ENABLE/DISABLE ATTRIBUTE AUTOSAVE",
" ec 00 00 00 00 00 00 00 00 00 00 00 00 00:08:31.541 IDENTIFY DEVICE",
"",
"SMART Extended Self-test Log Version: 1 (1 sectors)",
"Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error",
"# 1 Extended offline Completed without error 00% 6756 -",
"# 2 Extended offline Completed without error 00% 6573 -",
"# 3 Extended offline Completed without error 00% 102 -",
"# 4 Short offline Completed without error 00% 96 -",
"",
So I did the obvious: I failed and remove the drive from the array. The problem still showed up, but not as many fails in the same data set.
I have since added the drive back to the array, and am testing the
array now.
mdadm --monitor --test --oneshot /dev/md0
I begin to wonder if I have a bad motherboard.
On Mon, 1 Jun 2026 15:23:02 -0600
Andrew Latham <lathama@gmail.com> wrote:
I hate to suggest other tangents but re-seat all the connectors and
maybe a power supply test. A brown-out of power would cause issues
like this.
I did reseat all the data and power cables for the hard drives, the SSD system drive and the CD, etc. drive. I doubt I've had a brown-out, as
the computer in question is on a UPS which has plenty of spare battery.
As for the power supply test, not having a power supply tester, I have
not done that.
--
Does anybody read signatures any more?
https://charlescurley.com
https://charlescurley.com/blog/
I did reseat all the data and power cables for the hard drives, the SSD system drive and the CD, etc. drive.
I doubt I've had a brown-out, as
the computer in question is on a UPS which has plenty of spare battery.
As for the power supply test, not having a power supply tester, I have
not done that.
I hate to suggest other tangents but re-seat all the connectors and
maybe a power supply test. A brown-out of power would cause issues
like this.
I've seen too many hard-to-diagnose problems that disappeared after
replacing a sketchy PSU. Random hangs and crashes, hard drive errors, video noise, hang on reboot, etc. I no longer try to go cheap when choosing a PSU for builds.
I've seen too many hard-to-diagnose problems that disappeared after
replacing a sketchy PSU. Random hangs and crashes, hard drive errors, video >> noise, hang on reboot, etc. I no longer try to go cheap when choosing a PSU >> for builds.
Yeah, I'm actually surprised the hardware hasn't caught on accordingly:
while it's now standard for chips to monitor their temperature (and
adjust their power consumption if it gets too high), I still haven't
seen anything comparable that would detect and report when the input
voltage goes out-of-range (and maybe also take steps to reduce the instantaneous power consumption?).
Instead, we're in the dark, forced to try to avoid the problem
by over-provisioning the power supply and pray that it was enough.
=== Stefan
On 6/2/26 07:02, Stefan Monnier wrote:
I've seen too many hard-to-diagnose problems that disappeared after replacing a sketchy PSU. Random hangs and crashes, hard drive errors, video
noise, hang on reboot, etc. I no longer try to go cheap when choosing a PSU
for builds.
Yeah, I'm actually surprised the hardware hasn't caught on accordingly: while it's now standard for chips to monitor their temperature (and
adjust their power consumption if it gets too high), I still haven't
seen anything comparable that would detect and report when the input voltage goes out-of-range (and maybe also take steps to reduce the instantaneous power consumption?).
https://en.wikipedia.org/wiki/Lm_sensors
lm_sensors (Linux-monitoring sensors) is a free open-source software-tool
for Linux that provides tools and drivers for monitoring temperatures, voltage, humidity, and fans. It can also detect chassis intrusions.
David Christensen wrote:
On 6/2/26 07:02, Stefan Monnier wrote:
I've seen too many hard-to-diagnose problems that disappeared
after replacing a sketchy PSU. Random hangs and crashes, hard
drive errors, video noise, hang on reboot, etc. I no longer
try to go cheap when choosing a PSU for builds.
Yeah, I'm actually surprised the hardware hasn't caught onhttps://en.wikipedia.org/wiki/Lm_sensors
accordingly: while it's now standard for chips to monitor their temperature (and adjust their power consumption if it gets too
high), I still haven't seen anything comparable that would detect
and report when the input voltage goes out-of-range (and maybe
also take steps to reduce the instantaneous power consumption?).
lm_sensors (Linux-monitoring sensors) is a free open-source
software-tool for Linux that provides tools and drivers for
monitoring temperatures, voltage, humidity, and fans. It can also
detect chassis intrusions.
Almost all of those sensors are on the motherboard or on
attached cards; you'll frequently get to see voltages from the
CPU, but hardly ever does a power supply tell you about its
load.
They are also going to be polled, and will return the voltage they find
at polling time. What they won't tell you is what various ripple
voltages are, both the initial rectified mains at 100/120Hz and the
residual switching frequencies of the various step-down regulators.
Yeah, I'm actually surprised the hardware hasn't caught on accordingly:lm_sensors (Linux-monitoring sensors) is a free open-source software-tool
while it's now standard for chips to monitor their temperature (and
adjust their power consumption if it gets too high), I still haven't
seen anything comparable that would detect and report when the input
voltage goes out-of-range (and maybe also take steps to reduce the
instantaneous power consumption?).
Instead, we're in the dark, forced to try to avoid the problem
by over-provisioning the power supply and pray that it was enough.
for Linux that provides tools and drivers for monitoring temperatures, voltage, humidity, and fans. It can also detect chassis intrusions.
On 6/3/26 10:36, Stefan Monnier wrote:
They are also going to be polled, and will return the voltage they find
at polling time. What they won't tell you is what various ripple
voltages are, both the initial rectified mains at 100/120Hz and the
residual switching frequencies of the various step-down regulators.
Yeah, I think we'd need a kind of sensor that doesn't give just the
current voltage but gives a bracket of the lowest & highest voltage seen
since the last measurement, or one that can trigger an interrupt if the
voltage ever goes outside of a given range.
I want something that does a Fourier transform of the voltage data so I
can see if there's ripple at any particular frequency. A live waterfall
plot would be wonderful.
On 6/2/26 15:47, Joe wrote:
Einstein allegedly said:
"Insanity is doing the same thing over and over again and
expecting different results"
He had obviously never encountered the Intermittent Fault.I may be al ittle slow upstairs, but how does this differ from
persistence??????? Just asking!ÿÿÿÿÿ :-)
I have four four terabyte hard drives. Each has a partition on it. The
four partitions comprise a RAID 5 array using mdadm. On top of that,
LUKS encryption, then LVM with ext4 logical volumes.
I have four four terabyte hard drives. Each has a partition on it. The
four partitions comprise a RAID 5 array using mdadm. On top of that,
LUKS encryption, then LVM with ext4 logical volumes.
On one LVM partition I have a number of backup files, tarred,
bzipped, and sha256 and sha512 summed. I have a script which will find checksum files, and execute the appropriate program to test the
archives. It puts each program into the background, parallising any
number of checksum tests.
I believe I have found a solution to this problem. I installed the
backports kernel. Since then I have run more than four hours solid of
tests and not found a single error.
I did replace one hard drive. While that resulted in a quieter office,
it did not solve the problem.
Checking voltages from the power supply and the wall with a digital
volt meter did not show any out of spec problems.
I am glad that your storage is working correctly now.
Please run and post the following commands with both the previous
kernel and the backports kernel:
$ cat /etc/debian_version
$ uname -a
I will assume your script spawns a separate, isolated process for
each checksum file.
If you have ruled out the power supply, memory, and disks, another possibility could be a race condition in the kernel and/or I/O stack
that is triggered when multiple processes access storage in parallel.
Are the checksum errors repeatable on another computer with a similar storage architecture and the previous kernel? If so, do they
disappear with the backports kernel?
David
On Mon, 22 Jun 2026 00:32:19 -0700 David Christensen wrote:
Please run and post the following commands with both the previous
kernel and the backports kernel:
$ cat /etc/debian_version
$ uname -a
Backport
root@hawk:~# cat /etc/debian_version
13.5
root@hawk:~# cat /etc/os-release
PRETTY_NAME="Debian GNU/Linux 13 (trixie)"
NAME="Debian GNU/Linux"
VERSION_ID="13"
VERSION="13 (trixie)"
VERSION_CODENAME=trixie
DEBIAN_VERSION_FULL=13.5
ID=debian
HOME_URL="https://www.debian.org/" SUPPORT_URL="https://www.debian.org/support" BUG_REPORT_URL="https://bugs.debian.org/"
root@hawk:~# uname -s
Linux
root@hawk:~# uname -a
Linux hawk 7.0.10+deb13-amd64 #1 SMP PREEMPT_DYNAMIC Debian 7.0.10-1~bpo13+1 (2026-05-28) x86_64 GNU/Linux
root@hawk:~#
Prior kernel is the same except the kernel is
linux-image-6.12.90+deb13.1-amd64 6.12.90-2 amd64
I think that this problem first showed up late April or early May. If
that is correct, the following extracts from my apt logs might help.
These were copied and pasted unwrapped, so unless they were mangled in transit, long lines should be preserved.
...
Apparently, twice in April I had tried the backports kernel(s), and
found them unsatisfactory. So possibly the fix came in between linux-image-6.19.11+deb13-amd64 and the current backports kernel, linux-image-7.0.10+deb13-amd64.
Quite possibly a major version number
might even inadvertently fix something as subtle as this.
I will assume your script spawns a separate, isolated process for
each checksum file.
Correct. It does a find on *.sha256sums, *.sha512sums, and several other suffixes. It then sliced and dices to figure out the appropriate
program to call, sha256sum and sha512sum, respectively.
The files are created using paths relative to the current directory, so
when the script runs, it will pushd to that directory
The key line is
nice "${prog}" "${opts}" -c "${file}" &
Where $prog is the result of the slicing and dicing, opts='--quiet',
and $file the file to be scanned.
I recently added the option to limit the number of background tasks to
the number of processors (nproc --all). That reduces but does not
eliminate the number of errors.
If you have ruled out the power supply, memory, and disks, another
possibility could be a race condition in the kernel and/or I/O stack
that is triggered when multiple processes access storage in parallel.
I wouldn't say those are all ruled out. But the fact that the
backport kernel is a major version number change, and that it appears
to have solved the problem is highly suggestive. With that caveat, I
concur. That is definitely something for the kernel folks to look at.
Are the checksum errors repeatable on another computer with a similar
storage architecture and the previous kernel? If so, do they
disappear with the backports kernel?
I have a much more recent and much faster computer, peregrine, with nvme storage and 12 cores. Hawk has eight cores, and spinning rust. Hawk is
where the problem has shown up. Peregrine does not show the problem.
For 18 gig of data, hawk: 2m18.318s, peregrine 0m24.233s.
On 6/22/26 09:54, Charles Curley wrote:...assuming that there aren't any commits broken for other reasons, or otherwise commits where the codebase can't be built far enough for the
I think that this problem first showed up late April or early May.
If that is correct, the following extracts from my apt logs might
help. These were copied and pasted unwrapped, so unless they were
mangled in transit, long lines should be preserved.
...
Apparently, twice in April I had tried the backports kernel(s),
and found them unsatisfactory. So possibly the fix came in between
linux-image-6.19.11+deb13-amd64 and the current backports kernel,
linux-image-7.0.10+deb13-amd64.
AIUI if someone can come up with a shell script or program whose exit
value reliably indicates the presence or absence of a bug, Git can
do a binary search over a range of commits and locate the commit
where the bug originated.
It looks like there is a newer kernel for Trixie. It is best to file
bug reports against current packages. Can you test it?
https://packages.debian.org/stable/kernel/linux-image-6.12.94+deb13-amd64
On Mon, 22 Jun 2026 15:40:11 -0700
David Christensen <dpchrist@holgerdanske.com> wrote:
It looks like there is a newer kernel for Trixie. It is best to file
bug reports against current packages. Can you test it?
https://packages.debian.org/stable/kernel/linux-image-6.12.94+deb13-amd64
Tested. It came up with all sorts of fails.
charles@hawk:~$ uname -a
Linux hawk 6.12.94+deb13-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.12.94-1 (2026-06-20) x86_64 GNU/Linux
charles@hawk:~$
I gather you are suggesting I file a bug against this kernel, linux-image-6.12.94+deb13-amd64 6.12.94-1.
That would make sense, especially since Linux 6.12 appears to be the
kernel for Debian Stable (Trixie):
https://packages.debian.org/trixie/all/allpackages
On Thu, 25 Jun 2026 18:07:21 -0700 David Christensen wrote:
That would make sense, especially since Linux 6.12 appears to be the
kernel for Debian Stable (Trixie):
https://packages.debian.org/trixie/all/allpackages
OK, will do.
However, things are back up in the air. I rebooted to 7.0.10, and ran
some backups. The prior testing has been all reading: checksum
verification. Nothing on the disk actually changed. This backs from the
SSD to the RAID array. I had several directories fail with the error
message "failed: Bad message (74)", whatever that means.
I immediately fscked the logical volume. No errors. I then diffed the directories against the originals. There were for instances of files
found in the backups and not in the originals. Otherwise the "failed" directories were intact and duplicated the originals. I conjecture that
rsync attempted to delete them and failed to do so. I conjecture that
the next pass will get them.
| Sysop: | Jacob Catayoc |
|---|---|
| Location: | Pasay City, Metro Manila, Philippines |
| Users: | 4 |
| Nodes: | 4 (0 / 4) |
| Uptime: | 495145:23:04 |
| Calls: | 165 |
| Files: | 574 |
| D/L today: |
29 files (9,998K bytes) |
| Messages: | 78,198 |