• Schr”dinger's hash

    From Charles Curley@3:633/10 to All on Friday, May 22, 2026 18:00:01
    I have four four terabyte hard drives. Each has a partition on it. The
    four partitions comprise a RAID 5 array using mdadm. On top of that,
    LUKS encryption, then LVM with ext4 logical volumes.

    On one LVM partition I have a number of backup files, tarred,
    bzipped, and sha256 and sha512 summed. I have a script which will find
    checksum files, and execute the appropriate program to test the
    archives. It puts each program into the background, parallising any
    number of checksum tests.

    Starting about a week ago, the script finds an error in one or more
    files out of several. Results are inconsistent: one pass may find an
    error in a given file, the next pass not find any errors in it. Running checksums manually, one at a time, does not turn up an error. Running
    "tar tvf" finds no error in a suspect file. Running "bunzip2 -t" also
    turns up no error. Only running the script turns up any errors.

    I create two checksum files when I create the backups, for sha256 and
    sha512. After this problem surfaced (about a week ago), I then made two
    new checksum files of a suspect file. The two checksum file pairs
    (e.g. both sha512sum files) show the same checksums. The script now
    tests using both the old and new checksum files. Sometime only one pair
    of checksum files fail the suspect file.

    In addition to all of that, I also get the occasional "bad message"
    error. I have no idea what that means, but an fsck seems to deal with
    it.

    To be thorough, I have run extended SMART tests on the hard drives,
    kicked mdadm into testing the RAID array, and fscked the LVM partitions
    on the RAID array. Only fsck turned up issues, and that has not stopped.

    I also back some of this up to offsite USB drives. I ran the script on
    one of those, using a different computer. No errors reported.

    I have a hypothesis as to what is going on, but would like to hear from
    you before I discuss it.

    --
    Does anybody read signatures any more?

    https://charlescurley.com
    https://charlescurley.com/blog/

    --- PyGate Linux v1.5.14
    * Origin: Dragon's Lair, PyGate NNTP<>Fido Gate (3:633/10)
  • From Andrew Latham@3:633/10 to All on Friday, May 22, 2026 18:10:02
    I had an issue some months back. It turned out to be a bad RAM stick
    in my NAS. The issues would not show up on a restart but after some
    usage it would hit the RAM errors and :(

    This may not be your issue, but I remember how annoying it was to figure ou
    t.

    On Fri, May 22, 2026 at 9:53?AM Charles Curley <charlescurley@charlescurley.com> wrote:

    I have four four terabyte hard drives. Each has a partition on it. The
    four partitions comprise a RAID 5 array using mdadm. On top of that,
    LUKS encryption, then LVM with ext4 logical volumes.

    On one LVM partition I have a number of backup files, tarred,
    bzipped, and sha256 and sha512 summed. I have a script which will find checksum files, and execute the appropriate program to test the
    archives. It puts each program into the background, parallising any
    number of checksum tests.

    Starting about a week ago, the script finds an error in one or more
    files out of several. Results are inconsistent: one pass may find an
    error in a given file, the next pass not find any errors in it. Running checksums manually, one at a time, does not turn up an error. Running
    "tar tvf" finds no error in a suspect file. Running "bunzip2 -t" also
    turns up no error. Only running the script turns up any errors.

    I create two checksum files when I create the backups, for sha256 and
    sha512. After this problem surfaced (about a week ago), I then made two
    new checksum files of a suspect file. The two checksum file pairs
    (e.g. both sha512sum files) show the same checksums. The script now
    tests using both the old and new checksum files. Sometime only one pair
    of checksum files fail the suspect file.

    In addition to all of that, I also get the occasional "bad message"
    error. I have no idea what that means, but an fsck seems to deal with
    it.

    To be thorough, I have run extended SMART tests on the hard drives,
    kicked mdadm into testing the RAID array, and fscked the LVM partitions
    on the RAID array. Only fsck turned up issues, and that has not stopped.

    I also back some of this up to offsite USB drives. I ran the script on
    one of those, using a different computer. No errors reported.

    I have a hypothesis as to what is going on, but would like to hear from
    you before I discuss it.

    --
    Does anybody read signatures any more?

    https://charlescurley.com
    https://charlescurley.com/blog/



    --
    - Andrew "lathama" Latham -

    --- PyGate Linux v1.5.14
    * Origin: Dragon's Lair, PyGate NNTP<>Fido Gate (3:633/10)
  • From The Wanderer@3:633/10 to All on Friday, May 22, 2026 18:20:01
    On 2026-05-22 at 11:53, Charles Curley wrote:
    Starting about a week ago, the script finds an error in one or more
    files out of several. Results are inconsistent: one pass may find an
    error in a given file, the next pass not find any errors in it.
    Running checksums manually, one at a time, does not turn up an error.
    Running "tar tvf" finds no error in a suspect file. Running "bunzip2
    -t" also turns up no error. Only running the script turns up any
    errors.

    I create two checksum files when I create the backups, for sha256
    and sha512. After this problem surfaced (about a week ago), I then
    made two new checksum files of a suspect file. The two checksum file
    pairs (e.g. both sha512sum files) show the same checksums. The script
    now tests using both the old and new checksum files. Sometime only
    one pair of checksum files fail the suspect file.

    In addition to all of that, I also get the occasional "bad message"
    error. I have no idea what that means, but an fsck seems to deal
    with it.
    Just for clarity: where (from what source), and when (at what point), is
    it that you get this error?
    To be thorough, I have run extended SMART tests on the hard drives,
    kicked mdadm into testing the RAID array, and fscked the LVM
    partitions on the RAID array. Only fsck turned up issues, and that
    has not stopped.
    I have a hypothesis as to what is going on, but would like to hear
    from you before I discuss it.
    The very first thing that came to my mind out of that was RAM issues.
    Disk issues was the second, but the tests you've run there seem as if
    they'd probably have ruled that out.
    If you run a script to generate the hash of a given file in a loop
    (possibly with a don't-overload-the-system pause in between if you
    prefer), does it always show the same hash, or does it sometimes show a different one?
    --
    The Wanderer
    The reasonable man adapts himself to the world; the unreasonable one
    persists in trying to adapt the world to himself. Therefore all
    progress depends on the unreasonable man. -- George Bernard Shaw


    --- PyGate Linux v1.5.14
    * Origin: Dragon's Lair, PyGate NNTP<>Fido Gate (3:633/10)
  • From Charles Curley@3:633/10 to All on Friday, May 22, 2026 18:20:01
    On Fri, 22 May 2026 10:05:56 -0600
    Andrew Latham <lathama@gmail.com> wrote:

    I had an issue some months back. It turned out to be a bad RAM stick
    in my NAS. The issues would not show up on a restart but after some
    usage it would hit the RAM errors and :(

    This is not impossible. I recently had some RAM go bad, failing
    memtest. I have replaced it with new RAM, which does not fail
    memtest. Maybe I should let it run for several passes.

    Thanks.

    --
    Does anybody read signatures any more?

    https://charlescurley.com
    https://charlescurley.com/blog/

    --- PyGate Linux v1.5.14
    * Origin: Dragon's Lair, PyGate NNTP<>Fido Gate (3:633/10)
  • From Andrew Latham@3:633/10 to All on Friday, May 22, 2026 18:30:01
    Yes, I should have added that this RAM was only failing when warm/hot
    which was not fun to discover.

    On Fri, May 22, 2026 at 10:14?AM Charles Curley <charlescurley@charlescurley.com> wrote:

    On Fri, 22 May 2026 10:05:56 -0600
    Andrew Latham <lathama@gmail.com> wrote:

    I had an issue some months back. It turned out to be a bad RAM stick
    in my NAS. The issues would not show up on a restart but after some
    usage it would hit the RAM errors and :(

    This is not impossible. I recently had some RAM go bad, failing
    memtest. I have replaced it with new RAM, which does not fail
    memtest. Maybe I should let it run for several passes.

    Thanks.

    --
    Does anybody read signatures any more?

    https://charlescurley.com
    https://charlescurley.com/blog/



    --
    - Andrew "lathama" Latham -

    --- PyGate Linux v1.5.14
    * Origin: Dragon's Lair, PyGate NNTP<>Fido Gate (3:633/10)
  • From Charles Curley@3:633/10 to All on Friday, May 22, 2026 19:00:01
    On Fri, 22 May 2026 10:21:28 -0600
    Andrew Latham <lathama@gmail.com> wrote:

    Yes, I should have added that this RAM was only failing when warm/hot
    which was not fun to discover.

    Hmmm, I wonder if memtest would stress the RAM enough to get it hot. Interesting.

    I do have a handheld infrared thermometer, which I mostly use for
    cooking. But it would be perfect for the occasional RAM stress test.

    --
    Does anybody read signatures any more?

    https://charlescurley.com
    https://charlescurley.com/blog/

    --- PyGate Linux v1.5.14
    * Origin: Dragon's Lair, PyGate NNTP<>Fido Gate (3:633/10)
  • From Charles Curley@3:633/10 to All on Friday, May 22, 2026 19:10:01
    On Fri, 22 May 2026 12:19:08 -0400
    The Wanderer <wanderer@fastmail.fm> wrote:

    The very first thing that came to my mind out of that was RAM issues.
    Disk issues was the second, but the tests you've run there seem as if
    they'd probably have ruled that out.

    I agree that the tests I've run so far would tend to rule out disk
    issues. I just started another set of extended self tests, so we'll see
    where that goes. The longest should take about ten hours.


    If you run a script to generate the hash of a given file in a loop
    (possibly with a don't-overload-the-system pause in between if you
    prefer), does it always show the same hash, or does it sometimes show
    a different one?

    I haven't done such a script. But a freshly generated hash of a suspect
    file will agree with a hash created when the suspect file was created,
    which may have been five or seven years ago. At least so far.


    --
    Does anybody read signatures any more?

    https://charlescurley.com
    https://charlescurley.com/blog/

    --- PyGate Linux v1.5.14
    * Origin: Dragon's Lair, PyGate NNTP<>Fido Gate (3:633/10)
  • From nwe@3:633/10 to All on Friday, May 22, 2026 19:40:01
    On 5/22/26 11:05 AM, Andrew Latham wrote:

    I had an issue some months back. It turned out to be a bad RAM stick
    in my NAS.

    May I ask, was this ECC RAM?

    My personal NAS contains 16 sticks of 16GB DDR3 registered RAM. It is
    logging a CE memory scrubbing error once or twice a day since 700+ days.

    It is always the same page/address, triggering a soft offline of that
    memory page.

    At some point I mean to take the time to figure out which RAM stick is
    the culprit. I am aware of one unrecoverable RAM error which I
    discovered one morning in bios logs upon investigating why this server unexpectedly collapsed and restarted once at night something like 2
    years ago.

    If I knew there was harm in postponing the repair I might prioritize it.



    --- PyGate Linux v1.5.14
    * Origin: Dragon's Lair, PyGate NNTP<>Fido Gate (3:633/10)
  • From David Christensen@3:633/10 to All on Friday, May 22, 2026 20:10:01
    On 5/22/26 08:53, Charles Curley wrote:
    I have four four terabyte hard drives. Each has a partition on it. The
    four partitions comprise a RAID 5 array using mdadm. On top of that,
    LUKS encryption, then LVM with ext4 logical volumes.

    On one LVM partition I have a number of backup files, tarred,
    bzipped, and sha256 and sha512 summed. I have a script which will find checksum files, and execute the appropriate program to test the
    archives. It puts each program into the background, parallising any
    number of checksum tests.

    Starting about a week ago, the script finds an error in one or more
    files out of several. Results are inconsistent: one pass may find an
    error in a given file, the next pass not find any errors in it. Running checksums manually, one at a time, does not turn up an error. Running
    "tar tvf" finds no error in a suspect file. Running "bunzip2 -t" also
    turns up no error. Only running the script turns up any errors.

    I create two checksum files when I create the backups, for sha256 and
    sha512. After this problem surfaced (about a week ago), I then made two
    new checksum files of a suspect file. The two checksum file pairs
    (e.g. both sha512sum files) show the same checksums. The script now
    tests using both the old and new checksum files. Sometime only one pair
    of checksum files fail the suspect file.

    In addition to all of that, I also get the occasional "bad message"
    error. I have no idea what that means, but an fsck seems to deal with
    it.

    To be thorough, I have run extended SMART tests on the hard drives,
    kicked mdadm into testing the RAID array, and fscked the LVM partitions
    on the RAID array. Only fsck turned up issues, and that has not stopped.

    I also back some of this up to offsite USB drives. I ran the script on
    one of those, using a different computer. No errors reported.

    I have a hypothesis as to what is going on, but would like to hear from
    you before I discuss it.


    On 5/22/26 09:05, Andrew Latham wrote:
    I had an issue some months back. It turned out to be a bad RAM stick
    in my NAS. The issues would not show up on a restart but after some
    usage it would hit the RAM errors and :(


    On 5/22/26 09:14, Charles Curley wrote:
    This is not impossible. I recently had some RAM go bad, failing
    memtest. I have replaced it with new RAM, which does not fail
    memtest. Maybe I should let it run for several passes.


    When I suspect a memory problem, I run Memtest86+ for 24+ hours:

    https://memtest.org/

    Linux ISO (64 bits) -> mt86plus_8.10_x86_64.iso.zip

    The current version sets "CPU Sequencing Mode" to "Parallel (PAR)" by
    default.


    I use and suggest ECC memory.


    I use and suggest ZFS with redundant disks for storage.


    David

    --- PyGate Linux v1.5.14
    * Origin: Dragon's Lair, PyGate NNTP<>Fido Gate (3:633/10)
  • From Van Snyder@3:633/10 to All on Friday, May 22, 2026 20:40:01
    On Fri, 2026-05-22 at 09:53 -0600, Charles Curley wrote:
    I have four four terabyte hard drives. Each has a partition on it.
    The
    four partitions comprise a RAID 5 array using mdadm. On top of that,
    LUKS encryption, then LVM with ext4 logical volumes.
    I remarked to a local computer repair shop that I have a four TB backup
    drive. He said "replace it. Four TB isn't ready yet."
    On one LVM partition I have a number of backup files, tarred,
    bzipped, and sha256 and sha512 summed. I have a script which will
    find
    checksum files, and execute the appropriate program to test the
    archives. It puts each program into the background, parallising any
    number of checksum tests.

    Starting about a week ago, the script finds an error in one or more
    files out of several. Results are inconsistent: one pass may find an
    error in a given file, the next pass not find any errors in it.
    Running
    checksums manually, one at a time, does not turn up an error. Running
    "tar tvf" finds no error in a suspect file. Running "bunzip2 -t" also
    turns up no error. Only running the script turns up any errors.

    I create two checksum files when I create the backups, for sha256 and
    sha512. After this problem surfaced (about a week ago), I then made
    two
    new checksum files of a suspect file. The two checksum file pairs
    (e.g. both sha512sum files) show the same checksums. The script now
    tests using both the old and new checksum files. Sometime only one
    pair
    of checksum files fail the suspect file.

    In addition to all of that, I also get the occasional "bad message"
    error. I have no idea what that means, but an fsck seems to deal with
    it.

    To be thorough, I have run extended SMART tests on the hard drives,
    kicked mdadm into testing the RAID array, and fscked the LVM
    partitions
    on the RAID array. Only fsck turned up issues, and that has not
    stopped.

    I also back some of this up to offsite USB drives. I ran the script
    on
    one of those, using a different computer. No errors reported.

    I have a hypothesis as to what is going on, but would like to hear
    from
    you before I discuss it.



    --- PyGate Linux v1.5.14
    * Origin: Dragon's Lair, PyGate NNTP<>Fido Gate (3:633/10)
  • From nwe@3:633/10 to All on Friday, May 22, 2026 21:50:01
    On 5/22/26 1:32 PM, Van Snyder wrote:

    I remarked to a local computer repair shop that I have a four TB
    backup drive. He said "replace it. Four TB isn't ready yet."
    How so? I though 4TB is showing its age...

    I'm running 12x 4TB drives. Used SAS drives. Accumulated power on time
    ranges from 40,166 to 73,439 hours.

    Smartctl informs me device /dev/sdf is worsening with increased read
    errors over time. That one shows 73408 hours powered up, 72360.67 GB
    read, 119545.193 GB written, 195 power cycles (13 since July 13 2024).
    Defect list increased from 3 to 6872.

    I see two other drives have defect lists of 23 and 14, respectively. All others are at 0. Considering that, I should probably prioritize
    replacing at least sdf soon to avoid losing redundancy during resilver, considering the age of the pool.


    nwe@srv01:~$ zpool status -c vendor,model,size
    pool: POOL1
    state: ONLINE
    scan: scrub repaired 0B in 04:10:05 with 0 errors on Sun May 10 04:34:06
    2026
    config:

    NAMEÿ ÿ ÿ ÿ STATEÿ ÿ ÿREAD WRITE CKSUMÿ ÿvendorÿ ÿ ÿ ÿ ÿmodelÿ size
    POOL1ÿ ÿ ÿ ÿONLINEÿ ÿ ÿ ÿ0ÿ ÿ ÿ0ÿ ÿ ÿ0
    raidz3-0ÿ ONLINEÿ ÿ ÿ ÿ0ÿ ÿ ÿ0ÿ ÿ ÿ0
    sdbÿ ÿ ÿONLINEÿ ÿ ÿ ÿ0ÿ ÿ ÿ0ÿ ÿ ÿ0ÿ SEAGATEÿ ST4000NM0023ÿ 3.6T
    sdcÿ ÿ ÿONLINEÿ ÿ ÿ ÿ0ÿ ÿ ÿ0ÿ ÿ ÿ0ÿ TOSHIBAÿ ÿMG04SCA40ENÿ 3.6T
    sddÿ ÿ ÿONLINEÿ ÿ ÿ ÿ0ÿ ÿ ÿ0ÿ ÿ ÿ0ÿ TOSHIBAÿ ÿMG04SCA40ENÿ 3.6T
    sdeÿ ÿ ÿONLINEÿ ÿ ÿ ÿ0ÿ ÿ ÿ0ÿ ÿ ÿ0ÿ SEAGATEÿ ST4000NM0023ÿ 3.6T
    sdfÿ ÿ ÿONLINEÿ ÿ ÿ ÿ0ÿ ÿ ÿ0ÿ ÿ ÿ0ÿ SEAGATEÿ ST4000NM0023ÿ 3.6T
    sdhÿ ÿ ÿONLINEÿ ÿ ÿ ÿ0ÿ ÿ ÿ0ÿ ÿ ÿ0ÿ ÿ ÿ ÿHPÿ ÿMB4000FCWDKÿ 3.6T
    sdgÿ ÿ ÿONLINEÿ ÿ ÿ ÿ0ÿ ÿ ÿ0ÿ ÿ ÿ0ÿ ÿ ÿ ÿHPÿ ÿMB4000FCWDKÿ 3.6T
    sdiÿ ÿ ÿONLINEÿ ÿ ÿ ÿ0ÿ ÿ ÿ0ÿ ÿ ÿ0ÿ SEAGATEÿ ST4000NM0023ÿ 3.6T
    sdjÿ ÿ ÿONLINEÿ ÿ ÿ ÿ0ÿ ÿ ÿ0ÿ ÿ ÿ0ÿ SEAGATEÿ ST4000NM0023ÿ 3.6T
    sdkÿ ÿ ÿONLINEÿ ÿ ÿ ÿ0ÿ ÿ ÿ0ÿ ÿ ÿ0ÿ SEAGATEÿ ST4000NM0023ÿ 3.6T
    sdlÿ ÿ ÿONLINEÿ ÿ ÿ ÿ0ÿ ÿ ÿ0ÿ ÿ ÿ0ÿ ÿ ÿ ÿHPÿ ÿMB4000FCWDKÿ 3.6T
    sdmÿ ÿ ÿONLINEÿ ÿ ÿ ÿ0ÿ ÿ ÿ0ÿ ÿ ÿ0ÿ ÿ ÿ ÿHPÿ ÿMB4000FCWDKÿ 3.6T

    errors: No known data errors


    --- PyGate Linux v1.5.14
    * Origin: Dragon's Lair, PyGate NNTP<>Fido Gate (3:633/10)
  • From Charles Curley@3:633/10 to All on Friday, May 22, 2026 22:10:01
    On Fri, 22 May 2026 12:34:01 -0500
    nwe <nwe@gitcoding.net> wrote:

    May I ask, was this ECC RAM?

    Mine is not.

    Handle 0x0044, DMI type 17, 40 bytes
    Memory Device
    Array Handle: 0x0040
    Error Information Handle: Not Provided
    Total Width: 64 bits
    Data Width: 64 bits
    Size: 8 GB
    Form Factor: DIMM
    Set: None
    Locator: DIMM_B2
    Bank Locator: BANK 3
    Type: DDR3
    Type Detail: Synchronous
    Speed: 1600 MT/s
    Manufacturer: 0420
    Serial Number: 00000000
    Asset Tag: 9876543210
    Part Number: F3-1600C9-8GAB
    Rank: 2
    Configured Memory Speed: 1600 MT/s
    Minimum Voltage: 1.5 V
    Maximum Voltage: 1.5 V
    Configured Voltage: 1.5 V



    My personal NAS contains 16 sticks of 16GB DDR3 registered RAM. It is logging a CE memory scrubbing error once or twice a day since 700+
    days.

    Where would one find such evidence? I imagine something like:

    journalctl | grep -i RAM

    but nothing that produces jumps out at me.


    It is always the same page/address, triggering a soft offline of that
    memory page.


    --
    Does anybody read signatures any more?

    https://charlescurley.com
    https://charlescurley.com/blog/

    --- PyGate Linux v1.5.14
    * Origin: Dragon's Lair, PyGate NNTP<>Fido Gate (3:633/10)
  • From nwe@3:633/10 to All on Friday, May 22, 2026 22:30:05
    On 5/22/26 3:02 PM, Charles Curley wrote:

    journalctl | grep -i RAM

    Sure enough, that gets me a boatload of RAM error reports on my server.
    On my desktop without ECC it does not. I think no noise = good, however,
    I have rasdaemon installed on the server, I think it may take a
    combination of that + ECC to make the RAM errors log. It's been a while
    since I set this up. I think I had to change a setting in the Dell bios
    to prevent its log from eating the error instead of handing it to the os.

    I was simply reading sudo dmesg.

    If I'm correct, memtest86 is nearly useless on ECC RAM.



    --- PyGate Linux v1.5.14
    * Origin: Dragon's Lair, PyGate NNTP<>Fido Gate (3:633/10)
  • From Charles Curley@3:633/10 to All on Friday, May 22, 2026 23:10:02
    On Fri, 22 May 2026 15:22:05 -0500
    nwe <nwe@gitcoding.net> wrote:

    On 5/22/26 3:02 PM, Charles Curley wrote:

    journalctl | grep -i RAM

    Sure enough, that gets me a boatload of RAM error reports on my
    server.

    But not one on my desktop. I also went back several boots, and still no
    errors.

    On my desktop without ECC it does not.

    Possibly because without ECC one has no way to detect the errors. I
    don't know enough about modern RAM to be sure, so that's a guess.

    I think no noise = good,

    Not always. Maybe if you have ECC.

    however, I have rasdaemon installed on the server, I think it may
    take a combination of that + ECC to make the RAM errors log.

    That is consistent with the description provided by "apt show
    rasdaemon".

    It's been a while since I set this up. I think I had
    to change a setting in the Dell bios to prevent its log from eating
    the error instead of handing it to the os.

    I was simply reading sudo dmesg.

    If I'm correct, memtest86 is nearly useless on ECC RAM.


    Maybe.

    "MemTest86 directly polls ECC errors logged in the chipset/memory
    controller registers and displays it to the user on-screen. In
    addition, ECC errors are written to the log and report file.

    "During testing, MemTest86 may report ECC errors detected by the memory controller if ECC is supported and enabled. This is demonstrated in the following screenshot:"

    https://www.memtest86.com/ecc.htm#memtest86

    --
    Does anybody read signatures any more?

    https://charlescurley.com
    https://charlescurley.com/blog/

    --- PyGate Linux v1.5.14
    * Origin: Dragon's Lair, PyGate NNTP<>Fido Gate (3:633/10)
  • From nwe@3:633/10 to All on Friday, May 22, 2026 23:30:01
    On 5/22/26 4:02 PM, Charles Curley wrote:

    If I'm correct, memtest86 is nearly useless on ECC RAM.

    Maybe.

    "MemTest86 directly polls ECC errors logged in the chipset/memory
    controller registers and displays it to the user on-screen. In
    addition, ECC errors are written to the log and report file.

    "During testing, MemTest86 may report ECC errors detected by the memory controller if ECC is supported and enabled. This is demonstrated in the following screenshot:"

    https://www.memtest86.com/ecc.htm#memtest86

    That memtest86 info is more useful than I remembered.

    It's been too long since I studied the subject, more is coming back. The
    bios setting in my Dell R720XD rack server had something to do with a
    choice between having the hardware handle ECC versus allowing the os to control it. Default setting was hardware, at which point the underlying
    ECC corrections/faults seemed hidden/inaccessible from the os side of
    the memory controller and only appeared in bios logs.

    I just now tried to find it, but suspect I would have to take a look at
    it next time I'm in the server's bios settings. Which is not often. It
    is a matter of connecting a monitor and keyboard directly to it and
    rebooting, it takes it a couple minutes to post.



    --- PyGate Linux v1.5.14
    * Origin: Dragon's Lair, PyGate NNTP<>Fido Gate (3:633/10)
  • From David Christensen@3:633/10 to All on Saturday, May 23, 2026 01:30:01
    On 5/22/26 12:43, nwe wrote:
    On 5/22/26 1:32 PM, Van Snyder wrote:

    I remarked to a local computer repair shop that I have a four TB
    backup drive. He said "replace it. Four TB isn't ready yet."

    How so? I though 4TB is showing its age...


    +1 I am also curious why 4 TB HDD's are not "ready yet".


    I'm running 12x 4TB drives. Used SAS drives. Accumulated power on time ranges from 40,166 to 73,439 hours.

    Smartctl informs me device /dev/sdf is worsening with increased read
    errors over time. That one shows 73408 hours powered up, 72360.67 GB
    read, 119545.193 GB written, 195 power cycles (13 since July 13 2024). Defect list increased from 3 to 6872.

    I see two other drives have defect lists of 23 and 14, respectively. All others are at 0. Considering that, I should probably prioritize
    replacing at least sdf soon to avoid losing redundancy during resilver, considering the age of the pool.


    I am still trying to understand the smartctl(8) "SMART Attributes Data Structure". The RAW_VALUE seems to be a binary bit field (?) for
    several attributes and is useless without manufacturer engineering data.
    The VALUE column is supposed to be a percentage that starts at 100%
    and goes down to 0% as the disk wears out:

    * Raw_Read_Error_Rate, Seek_Error_Rate, and Hardware_ECC_Recovered can
    have low VALUE numbers, but the disk seems to keep working.

    * Low VALUE numbers for Reallocated_Sector_Ct, Current_Pending_Sector,
    and/or Offline_Uncorrectable seem to be reliable indicators of a failing
    disk.

    * I have not seen a VALUE number other than 100% for End-to-End_Error.


    nwe@srv01:~$ zpool status -c vendor,model,size
    pool: POOL1
    state: ONLINE
    scan: scrub repaired 0B in 04:10:05 with 0 errors on Sun May 10 04:34:06 2026
    config:

    NAMEÿ ÿ ÿ ÿ STATEÿ ÿ ÿREAD WRITE CKSUMÿ ÿvendorÿ ÿ ÿ ÿ ÿmodelÿ size
    POOL1ÿ ÿ ÿ ÿONLINEÿ ÿ ÿ ÿ0ÿ ÿ ÿ0ÿ ÿ ÿ0
    raidz3-0ÿ ONLINEÿ ÿ ÿ ÿ0ÿ ÿ ÿ0ÿ ÿ ÿ0
    sdbÿ ÿ ÿONLINEÿ ÿ ÿ ÿ0ÿ ÿ ÿ0ÿ ÿ ÿ0ÿ SEAGATEÿ ST4000NM0023ÿ 3.6T
    sdcÿ ÿ ÿONLINEÿ ÿ ÿ ÿ0ÿ ÿ ÿ0ÿ ÿ ÿ0ÿ TOSHIBAÿ ÿMG04SCA40ENÿ 3.6T
    sddÿ ÿ ÿONLINEÿ ÿ ÿ ÿ0ÿ ÿ ÿ0ÿ ÿ ÿ0ÿ TOSHIBAÿ ÿMG04SCA40ENÿ 3.6T
    sdeÿ ÿ ÿONLINEÿ ÿ ÿ ÿ0ÿ ÿ ÿ0ÿ ÿ ÿ0ÿ SEAGATEÿ ST4000NM0023ÿ 3.6T
    sdfÿ ÿ ÿONLINEÿ ÿ ÿ ÿ0ÿ ÿ ÿ0ÿ ÿ ÿ0ÿ SEAGATEÿ ST4000NM0023ÿ 3.6T
    sdhÿ ÿ ÿONLINEÿ ÿ ÿ ÿ0ÿ ÿ ÿ0ÿ ÿ ÿ0ÿ ÿ ÿ ÿHPÿ ÿMB4000FCWDKÿ 3.6T
    sdgÿ ÿ ÿONLINEÿ ÿ ÿ ÿ0ÿ ÿ ÿ0ÿ ÿ ÿ0ÿ ÿ ÿ ÿHPÿ ÿMB4000FCWDKÿ 3.6T
    sdiÿ ÿ ÿONLINEÿ ÿ ÿ ÿ0ÿ ÿ ÿ0ÿ ÿ ÿ0ÿ SEAGATEÿ ST4000NM0023ÿ 3.6T
    sdjÿ ÿ ÿONLINEÿ ÿ ÿ ÿ0ÿ ÿ ÿ0ÿ ÿ ÿ0ÿ SEAGATEÿ ST4000NM0023ÿ 3.6T
    sdkÿ ÿ ÿONLINEÿ ÿ ÿ ÿ0ÿ ÿ ÿ0ÿ ÿ ÿ0ÿ SEAGATEÿ ST4000NM0023ÿ 3.6T
    sdlÿ ÿ ÿONLINEÿ ÿ ÿ ÿ0ÿ ÿ ÿ0ÿ ÿ ÿ0ÿ ÿ ÿ ÿHPÿ ÿMB4000FCWDKÿ 3.6T
    sdmÿ ÿ ÿONLINEÿ ÿ ÿ ÿ0ÿ ÿ ÿ0ÿ ÿ ÿ0ÿ ÿ ÿ ÿHPÿ ÿMB4000FCWDKÿ 3.6T

    errors: No known data errors


    Twelve disks gives you many choices for how to layout the pool and
    trade-off redundancy vs. capacity vs. performance. Is the data balanced across disks? Does the machine have enough memory? Is the ARC working
    well?


    On two of my earlier pools, I added a 60 GB SSD as a cache vdev after
    the pools had data. I did not notice any improvement.


    On one of my earlier pools of one mirror of two 3 TB HDD's that was
    nearly full, I added another mirror of two 3 TB HDD's. I did not notice
    any improvement.


    I rebuilt the storage pool with two mirrors of two 3 TB HDD's each and a special vdev mirror of two 180 GB SSD's. I also set
    special_small_blocks=16K. I then restored the data via replication.
    The data is now balanced across disks, latency has dropped, throughput
    has increased, and overall performance is noticeably better:

    2026-05-22 15:12:45 toor@f5 ~
    # freebsd-version
    13.5-RELEASE-p12

    2026-05-22 15:19:47 toor@f5 ~
    # zpool iostat -v p5
    capacity operations bandwidth
    pool alloc free read write read write
    ---------------- ----- ----- ----- ----- ----- -----
    p5 3.76T 1.82T 6 1 3.68M 32.2K
    mirror-0 1.87T 871G 2 0 1.82M 4.48K
    gpt/hdd0.eli - - 1 0 931K 2.24K
    gpt/hdd1.eli - - 1 0 931K 2.24K
    mirror-1 1.86T 876G 2 0 1.81M 4.35K
    gpt/hdd2.eli - - 1 0 928K 2.18K
    gpt/hdd3.eli - - 1 0 928K 2.18K
    special - - - - - -
    mirror-2 31.1G 118G 1 1 51.2K 23.3K
    gpt/ssd0.eli - - 0 0 25.6K 11.7K
    gpt/ssd1.eli - - 0 0 25.6K 11.7K
    ---------------- ----- ----- ----- ----- ----- ----- -----

    2026-05-22 15:32:42 toor@f5 ~
    # top -d 1 | head -n 7
    last pid: 57622; load averages: 0.24, 0.21, 0.17 up 24+22:47:05
    15:32:45
    27 processes: 1 running, 26 sleeping
    CPU: 0.0% user, 0.0% nice, 0.6% system, 0.0% interrupt, 99.4% idle
    Mem: 4848K Active, 330M Inact, 856K Laundry, 14G Wired, 920M Buf, 694M Free ARC: 12G Total, 10G MFU, 485M MRU, 3328K Anon, 200M Header, 899M Other
    9921M Compressed, 33G Uncompressed, 3.36:1 Ratio
    Swap: 764M Total, 764M Free

    2026-05-22 15:33:12 toor@f5 ~
    # arc_summary | grep -A 5 "ARC total accesses"
    ARC total accesses (hits + misses): 512.7M
    Cache hit ratio: 99.8 % 511.8M
    Cache miss ratio: 0.2 % 886.5k
    Actual hit ratio (MFU + MRU hits): 99.3 % 509.2M
    Data demand efficiency: 99.5 % 4.8M
    Data prefetch efficiency: 19.2 % 96.9k


    In hindsight:

    1. I gathered file system statistics prior rebuilding the pool and
    setting special_small_blocks=16K, but it now appears I could have used a larger value.

    2. If I get worried about HDD's failing, I can add disks to the pool as spares and/or add disks to the data mirrors. The latter should increase
    read performance even more.

    3. My ~10 year old HDD's can already saturate Gigabit with sequential
    I/O. RAID 10 with SSD acceleration is even more overkill. I want 10 GbE.


    David

    --- PyGate Linux v1.5.14
    * Origin: Dragon's Lair, PyGate NNTP<>Fido Gate (3:633/10)
  • From Van Snyder@3:633/10 to All on Saturday, May 23, 2026 01:40:01
    On Fri, 2026-05-22 at 16:25 -0700, David Christensen wrote:
    On 5/22/26 12:43, nwe wrote:
    On 5/22/26 1:32 PM, Van Snyder wrote:

    I remarked to a local computer repair shop that I have a four TB
    backup drive. He said "replace it. Four TB isn't ready yet."

    How so? I though 4TB is showing its age...


    +1ÿ I am also curious why 4 TB HDD's are not "ready yet".
    The repair guy who made that remark to me didn't explain why he
    believed it. Maybe he's just not keeping up to what's happening. My WDC WD40EDAZ-11CFPB0 reports 68 power-on resets, zero read errors, and zero
    seek errors, but with only 2,791 power-on hours.
    I'm running 12x 4TB drives. Used SAS drives. Accumulated power on
    time
    ranges from 40,166 to 73,439 hours.

    Smartctl informs me device /dev/sdf is worsening with increased
    read
    errors over time. That one shows 73408 hours powered up, 72360.67
    GB
    read, 119545.193 GB written, 195 power cycles (13 since July 13
    2024).
    Defect list increased from 3 to 6872.

    I see two other drives have defect lists of 23 and 14,
    respectively. All
    others are at 0. Considering that, I should probably prioritize
    replacing at least sdf soon to avoid losing redundancy during
    resilver,
    considering the age of the pool.


    I am still trying to understand the smartctl(8) "SMART Attributes
    Data
    Structure".ÿ The RAW_VALUE seems to be a binary bit field (?) for
    several attributes and is useless without manufacturer engineering
    data.
    ÿ The VALUE column is supposed to be a percentage that starts at 100%
    and goes down to 0% as the disk wears out:

    * Raw_Read_Error_Rate, Seek_Error_Rate, and Hardware_ECC_Recovered
    can
    have low VALUE numbers, but the disk seems to keep working.

    * Low VALUE numbers for Reallocated_Sector_Ct,
    Current_Pending_Sector,
    and/or Offline_Uncorrectable seem to be reliable indicators of a
    failing
    disk.

    * I have not seen a VALUE number other than 100% for End-to-
    End_Error.


    nwe@srv01:~$ zpool status -c vendor,model,size
    pool: POOL1
    state: ONLINE
    scan: scrub repaired 0B in 04:10:05 with 0 errors on Sun May 10
    04:34:06
    2026
    config:

    NAMEÿ ÿ ÿ ÿ STATEÿ ÿ ÿREAD WRITE CKSUMÿ ÿvendorÿ ÿ ÿ ÿ ÿmodelÿ size
    POOL1ÿ ÿ ÿ ÿONLINEÿ ÿ ÿ ÿ0ÿ ÿ ÿ0ÿ ÿ ÿ0
    raidz3-0ÿ ONLINEÿ ÿ ÿ ÿ0ÿ ÿ ÿ0ÿ ÿ ÿ0
    sdbÿ ÿ ÿONLINEÿ ÿ ÿ ÿ0ÿ ÿ ÿ0ÿ ÿ ÿ0ÿ SEAGATEÿ ST4000NM0023ÿ 3.6T
    sdcÿ ÿ ÿONLINEÿ ÿ ÿ ÿ0ÿ ÿ ÿ0ÿ ÿ ÿ0ÿ TOSHIBAÿ ÿMG04SCA40ENÿ 3.6T
    sddÿ ÿ ÿONLINEÿ ÿ ÿ ÿ0ÿ ÿ ÿ0ÿ ÿ ÿ0ÿ TOSHIBAÿ ÿMG04SCA40ENÿ 3.6T
    sdeÿ ÿ ÿONLINEÿ ÿ ÿ ÿ0ÿ ÿ ÿ0ÿ ÿ ÿ0ÿ SEAGATEÿ ST4000NM0023ÿ 3.6T
    sdfÿ ÿ ÿONLINEÿ ÿ ÿ ÿ0ÿ ÿ ÿ0ÿ ÿ ÿ0ÿ SEAGATEÿ ST4000NM0023ÿ 3.6T
    sdhÿ ÿ ÿONLINEÿ ÿ ÿ ÿ0ÿ ÿ ÿ0ÿ ÿ ÿ0ÿ ÿ ÿ ÿHPÿ ÿMB4000FCWDKÿ 3.6T
    sdgÿ ÿ ÿONLINEÿ ÿ ÿ ÿ0ÿ ÿ ÿ0ÿ ÿ ÿ0ÿ ÿ ÿ ÿHPÿ ÿMB4000FCWDKÿ 3.6T
    sdiÿ ÿ ÿONLINEÿ ÿ ÿ ÿ0ÿ ÿ ÿ0ÿ ÿ ÿ0ÿ SEAGATEÿ ST4000NM0023ÿ 3.6T
    sdjÿ ÿ ÿONLINEÿ ÿ ÿ ÿ0ÿ ÿ ÿ0ÿ ÿ ÿ0ÿ SEAGATEÿ ST4000NM0023ÿ 3.6T
    sdkÿ ÿ ÿONLINEÿ ÿ ÿ ÿ0ÿ ÿ ÿ0ÿ ÿ ÿ0ÿ SEAGATEÿ ST4000NM0023ÿ 3.6T
    sdlÿ ÿ ÿONLINEÿ ÿ ÿ ÿ0ÿ ÿ ÿ0ÿ ÿ ÿ0ÿ ÿ ÿ ÿHPÿ ÿMB4000FCWDKÿ 3.6T
    sdmÿ ÿ ÿONLINEÿ ÿ ÿ ÿ0ÿ ÿ ÿ0ÿ ÿ ÿ0ÿ ÿ ÿ ÿHPÿ ÿMB4000FCWDKÿ 3.6T

    errors: No known data errors


    Twelve disks gives you many choices for how to layout the pool and
    trade-off redundancy vs. capacity vs. performance.ÿ Is the data
    balanced
    across disks?ÿ Does the machine have enough memory?ÿ Is the ARC
    working
    well?


    On two of my earlier pools, I added a 60 GB SSD as a cache vdev after
    the pools had data.ÿ I did not notice any improvement.


    On one of my earlier pools of one mirror of two 3 TB HDD's that was
    nearly full, I added another mirror of two 3 TB HDD's.ÿ I did not
    notice
    any improvement.


    I rebuilt the storage pool with two mirrors of two 3 TB HDD's each
    and a
    special vdev mirror of two 180 GB SSD's.ÿ I also set special_small_blocks=16K.ÿ I then restored the data via replication.
    The data is now balanced across disks, latency has dropped,
    throughput
    has increased, and overall performance is noticeably better:

    2026-05-22 15:12:45 toor@f5 ~
    # freebsd-version
    13.5-RELEASE-p12

    2026-05-22 15:19:47 toor@f5 ~
    # zpool iostat -v p5
    ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿ capacityÿÿÿÿ operationsÿÿÿÿ bandwidth
    poolÿÿÿÿÿÿÿÿÿÿÿÿÿ allocÿÿ freeÿÿ readÿ writeÿÿ readÿ write
    ----------------ÿ -----ÿ -----ÿ -----ÿ -----ÿ -----ÿ -----
    p5ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿ 3.76Tÿ 1.82Tÿÿÿÿÿ 6ÿÿÿÿÿ 1ÿ 3.68Mÿ 32.2K
    ÿÿ mirror-0ÿÿÿÿÿÿÿ 1.87Tÿÿ 871Gÿÿÿÿÿ 2ÿÿÿÿÿ 0ÿ 1.82Mÿ 4.48K
    ÿÿÿÿ gpt/hdd0.eliÿÿÿÿÿ -ÿÿÿÿÿ -ÿÿÿÿÿ 1ÿÿÿÿÿ 0ÿÿ 931Kÿ 2.24K
    ÿÿÿÿ gpt/hdd1.eliÿÿÿÿÿ -ÿÿÿÿÿ -ÿÿÿÿÿ 1ÿÿÿÿÿ 0ÿÿ 931Kÿ 2.24K
    ÿÿ mirror-1ÿÿÿÿÿÿÿ 1.86Tÿÿ 876Gÿÿÿÿÿ 2ÿÿÿÿÿ 0ÿ 1.81Mÿ 4.35K
    ÿÿÿÿ gpt/hdd2.eliÿÿÿÿÿ -ÿÿÿÿÿ -ÿÿÿÿÿ 1ÿÿÿÿÿ 0ÿÿ 928Kÿ 2.18K
    ÿÿÿÿ gpt/hdd3.eliÿÿÿÿÿ -ÿÿÿÿÿ -ÿÿÿÿÿ 1ÿÿÿÿÿ 0ÿÿ 928Kÿ 2.18K
    specialÿÿÿÿÿÿÿÿÿÿÿÿÿÿ -ÿÿÿÿÿ -ÿÿÿÿÿ -ÿÿÿÿÿ -ÿÿÿÿÿ -ÿÿÿÿÿ -
    ÿÿ mirror-2ÿÿÿÿÿÿÿ 31.1Gÿÿ 118Gÿÿÿÿÿ 1ÿÿÿÿÿ 1ÿ 51.2Kÿ 23.3K
    ÿÿÿÿ gpt/ssd0.eliÿÿÿÿÿ -ÿÿÿÿÿ -ÿÿÿÿÿ 0ÿÿÿÿÿ 0ÿ 25.6Kÿ 11.7K
    ÿÿÿÿ gpt/ssd1.eliÿÿÿÿÿ -ÿÿÿÿÿ -ÿÿÿÿÿ 0ÿÿÿÿÿ 0ÿ 25.6Kÿ 11.7K
    ----------------ÿ -----ÿ -----ÿ -----ÿ -----ÿ -----ÿ -----ÿ -----

    2026-05-22 15:32:42 toor@f5 ~
    # top -d 1 | head -n 7
    last pid: 57622;ÿ load averages:ÿ 0.24,ÿ 0.21,ÿ 0.17ÿ up 24+22:47:05 15:32:45
    27 processes:ÿ 1 running, 26 sleeping
    CPU:ÿ 0.0% user,ÿ 0.0% nice,ÿ 0.6% system,ÿ 0.0% interrupt, 99.4%
    idle
    Mem: 4848K Active, 330M Inact, 856K Laundry, 14G Wired, 920M Buf,
    694M Free
    ARC: 12G Total, 10G MFU, 485M MRU, 3328K Anon, 200M Header, 899M
    Other
    ÿÿÿÿÿ 9921M Compressed, 33G Uncompressed, 3.36:1 Ratio
    Swap: 764M Total, 764M Free

    2026-05-22 15:33:12 toor@f5 ~
    # arc_summary | grep -A 5 "ARC total accesses"
    ARC total accesses (hits + misses):ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿ
    512.7M
    ÿÿÿÿÿÿÿÿ Cache hit ratio:ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿ 99.8 %ÿÿÿÿ
    511.8M
    ÿÿÿÿÿÿÿÿ Cache miss ratio:ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿ 0.2 %ÿÿÿÿ
    886.5k
    ÿÿÿÿÿÿÿÿ Actual hit ratio (MFU + MRU hits):ÿÿÿÿÿÿÿÿÿÿÿÿ 99.3 %ÿÿÿÿ
    509.2M
    ÿÿÿÿÿÿÿÿ Data demand efficiency:ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿ 99.5 %ÿÿÿÿÿÿ
    4.8M
    ÿÿÿÿÿÿÿÿ Data prefetch efficiency:ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿ 19.2 %ÿÿÿÿÿ
    96.9k


    In hindsight:

    1.ÿ I gathered file system statistics prior rebuilding the pool and
    setting special_small_blocks=16K, but it now appears I could have
    used a
    larger value.

    2.ÿ If I get worried about HDD's failing, I can add disks to the pool
    as
    spares and/or add disks to the data mirrors.ÿ The latter should
    increase
    read performance even more.

    3.ÿ My ~10 year old HDD's can already saturate Gigabit with
    sequential
    I/O.ÿ RAID 10 with SSD acceleration is even more overkill.ÿ I want 10
    GbE.


    David



    --- PyGate Linux v1.5.14
    * Origin: Dragon's Lair, PyGate NNTP<>Fido Gate (3:633/10)
  • From David Christensen@3:633/10 to All on Saturday, May 23, 2026 02:00:01
    On 5/22/26 13:22, nwe wrote:
    On 5/22/26 3:02 PM, Charles Curley wrote:

    journalctl | grep -i RAM

    Sure enough, that gets me a boatload of RAM error reports on my server.
    On my desktop without ECC it does not. I think no noise = good, however,
    I have rasdaemon installed on the server, I think it may take a
    combination of that + ECC to make the RAM errors log. It's been a while since I set this up. I think I had to change a setting in the Dell bios
    to prevent its log from eating the error instead of handing it to the os.

    I was simply reading sudo dmesg.

    If I'm correct, memtest86 is nearly useless on ECC RAM.


    MemTest86 v11.7 Free Edition claims to support ECC:

    https://www.memtest86.com/compare.html


    AFAICT memtest86+ does not support ECC. Some people suggest disabling
    ECC in BIOS/UEFI Setup and then testing with memtest86+.


    David

    --- PyGate Linux v1.5.14
    * Origin: Dragon's Lair, PyGate NNTP<>Fido Gate (3:633/10)
  • From nwe@3:633/10 to All on Saturday, May 23, 2026 02:20:01
    On 5/22/26 6:25 PM, David Christensen wrote:

    I am still trying to understand the smartctl(8) "SMART Attributes Data Structure".ÿ The RAW_VALUE seems to be a binary bit field (?)

    same here

    some makes/models of hardware seem to produce a greater quantity of comprehensible smart data

    I run like
    # smartctl -x /dev/sdf
    returns additional data.


    Twelve disks gives you many choices for how to layout the pool and
    trade-off redundancy vs. capacity vs. performance.ÿ Is the data
    balanced across disks?ÿ Does the machine have enough memory?ÿ Is the
    ARC working well?

    It has 256GB RAM.

    My desktop pc currently has only 1Gb networking ever since I replaced my
    fiber optic card with a gpu in the lone PCIe slot. During the time I had
    10g networking direct from server to workstation, I recall easily
    saturating the network. Amazing speed, but I needed the gpu more. 10g networking is faster than the cheap SSDs in most of my PCs.



    --- PyGate Linux v1.5.14
    * Origin: Dragon's Lair, PyGate NNTP<>Fido Gate (3:633/10)
  • From David Christensen@3:633/10 to All on Saturday, May 23, 2026 07:50:01
    On 5/22/26 17:16, nwe wrote:
    On 5/22/26 6:25 PM, David Christensen wrote:

    I am still trying to understand the smartctl(8) "SMART Attributes Data
    Structure".ÿ The RAW_VALUE seems to be a binary bit field (?)

    same here

    some makes/models of hardware seem to produce a greater quantity of comprehensible smart data

    I run like
    # smartctl -x /dev/sdf
    returns additional data.


    Twelve disks gives you many choices for how to layout the pool and
    trade-off redundancy vs. capacity vs. performance.ÿ Is the data
    balanced across disks?ÿ Does the machine have enough memory?ÿ Is the
    ARC working well?

    It has 256GB RAM.

    My desktop pc currently has only 1Gb networking ever since I replaced my fiber optic card with a gpu in the lone PCIe slot. During the time I had
    10g networking direct from server to workstation, I recall easily
    saturating the network. Amazing speed, but I needed the gpu more. 10g networking is faster than the cheap SSDs in most of my PCs.


    12 HDD raidz3 already exceeds 10 Gbps for sequential I/O.


    256 GB of memory could be enough to fit your entire workload within the
    ARC, so random I/O may have also saturated 10 Gbps. If your processor
    memory bus has enough channels, your NIC has enough PCIe lanes, and your workload does not require synchronous writes (or you tune ZFS to fake
    it), a suitable workstation or backup server could saturate 25, 50, 100
    Gbps single/ dual/ quad Ethernet. SFPx switches are expensive, so I
    have considered dual SFPx cards in my workstation, primary server, and
    backup server; connected in a ring.


    If you have a USB 3.x A or C port, various manufacturers make 2.5, 5,
    and 10 GbE (copper RJ-45) Ethernet adapters. If you have a Thunderbolt
    3, 4, or 5 port, a few make 10 and 25 Gbps SFPx fiber single and dual
    Ethernet adapters. Be sure to verify Debian and Linux driver support
    with the manufacturer before purchasing:

    https://www.amazon.com/s?k=thunderbolt+sfp+ethernet

    https://www.sonnettech.com/product/twin25g/overview.html


    I have been using Intel SSD 520 Series 2.5" SATA III drives for many
    years. They are an enterprise grade product that were put in various desktops, laptops, and netbooks ~14 years ago. Resellers harvest and
    resell them on eBay. The 60 GB model works well for Linux and FreeBSD
    system drives. The 180 GB model has the best performance
    specifications, and works well for Windows system drives and ZFS
    accelerators (at my 6 TB scale). Prices for used smaller drives can
    still be reasonable in spite of the AI bubble.


    All that said, when your workload goes outside the ARC or the HDD
    caches, HDD latency will become the bottleneck. SATA/SAS 6 or 12 Gbps
    SSD accelerators chosen to match the workload could help. NVMe PCIe
    would be even better. Optane would be best:

    https://goughlui.com/2024/07/28/tech-flashback-intel-optane-3d-xpoint-memory/


    How do you back up a 36 TB pool?


    David

    --- PyGate Linux v1.5.14
    * Origin: Dragon's Lair, PyGate NNTP<>Fido Gate (3:633/10)
  • From nwe@3:633/10 to All on Saturday, May 23, 2026 17:00:01
    On 5/23/26 12:44 AM, David Christensen wrote:

    If you have a USB 3.x A or C port, various manufacturers make 2.5, 5,
    and 10 GbE (copper RJ-45) Ethernet adapters.ÿ If you have a
    Thunderbolt 3, 4, or 5 port, a few make 10 and 25 Gbps SFPx fiber
    single and dual Ethernet adapters.ÿ Be sure to verify Debian and Linux driver support with the manufacturer before purchasing:

    https://www.amazon.com/s?k=thunderbolt+sfp+ethernet

    https://www.sonnettech.com/product/twin25g/overview.html
    I've looked at those, been thinking of trying it some time. The only
    time I really wish for faster than 1g networking straight to my
    workstation is when I'm cloning a complete hard drive to a backup image
    file in the pool.

    I've been buying cheap 10G dual SFP PCIe cards on ebay. I configure the
    two ports as a bridge then just run fiber from one machine to the next.
    The server is in the middle of the string, so if I shut that down, my
    whole network is as good as down. Nearly all my network services depend
    on the server anyway.

    The only glitch I've run into so far is I've got to match the correct
    optics to the cards, speaking of vendor-lock-in. Intel cards want intel optics. Most other brands seem to accept most other brands optics. No experience yet with Cisco brand. Dell can come as intel or other.

    Cheap managed network switches from aliexpress two 10g SFP+ ports plus
    four 1g RJ45, I guess I got what I paid for. They seem to mostly work, I
    had a few random failures along the way. I've read scare stories about
    these potentially dialing home etc. I have not confirmed such. I suspect
    these can't as I have them configured on an isolated vlan.

    One wonderful network switch deal I find common on ebay: OS6450-P48 it
    is cheap, 48 poe 1gig ports plus two 10g SFP ports that don't seem real
    picky what brand optics I use. It is a bit technical to configure.

    I have been using Intel SSD 520 Series 2.5" SATA III drives for many
    years.
    Same here, just discovered them cheap on ebay a few years ago. Some show
    up with SMART reports indicating wear that would trigger me replacing a consumer grade ssd. I've been using those nearly daily over 2 years with
    no failures yet.
    How do you back up a 36 TB pool?

    (blush)

    I know my current backup scheme could someday bite:

    1. I have a full rsync backup on a spare R710 server in another
    building. That server probably hasn't been powered up in 18 months. All
    I need to do is go plug it into an outlet then spend a couple minutes at
    a ssh session from the comfort of my office chair.

    2. The most critical files get regularly backed up (manually) to a
    remote storage vps.
    I compress the files into an encrypted archive then upload using scp.

    3. Some random memory sticks and hard drives...

    So I've got some coverage in case of disaster but it could use improvement.
    I know automatic backups could be nice but I just don't trust such.

    I also have a clone of the server's boot drive.

    nwe



    --- PyGate Linux v1.5.14
    * Origin: Dragon's Lair, PyGate NNTP<>Fido Gate (3:633/10)
  • From David Christensen@3:633/10 to All on Sunday, May 24, 2026 04:30:01
    On 5/23/26 07:52, nwe wrote:
    On 5/23/26 12:44 AM, David Christensen wrote:
    If you have a USB 3.x A or C port, various manufacturers make 2.5, 5,
    and 10 GbE (copper RJ-45) Ethernet adapters.ÿ If you have a
    Thunderbolt 3, 4, or 5 port, a few make 10 and 25 Gbps SFPx fiber
    single and dual Ethernet adapters.

    I've looked at those, been thinking of trying it some time. The only
    time I really wish for faster than 1g networking straight to my
    workstation is when I'm cloning a complete hard drive to a backup image
    file in the pool.


    +1 for images, especially when they are encrypted and uncompressible. I
    image my system drives monthly and keep the newest three. The Linux and FreeBSD system images are deliberately small enough to fit onto "16 GB" devices (SD card, USB, SSD, or HDD). This is plenty for the servers and maintenance/rescue live drives, but I sometimes want more for the
    workstation. The worst are Windows machines with one drive and
    everything on it.


    Another use-case for enterprise-speed Ethernet is ZFS replication.


    The only glitch I've run into so far is I've got to match the correct
    optics to the cards, speaking of vendor-lock-in. Intel cards want intel optics. Most other brands seem to accept most other brands optics. No experience yet with Cisco brand. Dell can come as intel or other.


    Thank you for the warning.


    Cheap managed network switches from aliexpress two 10g SFP+ ports plus
    four 1g RJ45, I guess I got what I paid for. They seem to mostly work, I
    had a few random failures along the way. I've read scare stories about
    these potentially dialing home etc. I have not confirmed such. I suspect these can't as I have them configured on an isolated vlan.

    One wonderful network switch deal I find common on ebay: OS6450-P48 it
    is cheap, 48 poe 1gig ports plus two 10g SFP ports that don't seem real picky what brand optics I use. It is a bit technical to configure.


    I switched to Ubiquitti Networks Unifi products several years ago when I
    got tired of logging in to multiple different web control panels and
    trying to keep all the network settings in sync. "One web control panel
    to rule them all" is a killer feature that is worth paying for. So, I
    look at their 2.5/5/10 Gbps switches periodically; but have not made a purchase (yet).


    I have been using Intel SSD 520 Series 2.5" SATA III drives for many
    years.

    Same here, just discovered them cheap on ebay a few years ago. Some show
    up with SMART reports indicating wear that would trigger me replacing a consumer grade ssd. I've been using those nearly daily over 2 years with
    no failures yet.


    I bought my first Intel SSD 520 Series 60 GB 2.5" SATA III at Best Buy
    on Black Friday the year they came out. I now have eight 60 GB drives
    plus eight 180 GB drives. Many were bought used. I agree that some of
    their SMART statistics can look scary, but the lowest
    Media_Wearout_Indicator VALUE for any of them is 95. Using 180 GB
    drives as a ZFS special vdev with special_small_blocks and 24x7 in a
    lightly used SOHO server is slowly eating those drives.


    How do you back up a 36 TB pool?

    (blush)

    I know my current backup scheme could someday bite:

    1. I have a full rsync backup on a spare R710 server in another
    building. That server probably hasn't been powered up in 18 months. All
    I need to do is go plug it into an outlet then spend a couple minutes at
    a ssh session from the comfort of my office chair.

    2. The most critical files get regularly backed up (manually) to a
    remote storage vps.
    I compress the files into an encrypted archive then upload using scp.

    3. Some random memory sticks and hard drives...

    So I've got some coverage in case of disaster but it could use improvement.
    I know automatic backups could be nice but I just don't trust such.

    I also have a clone of the server's boot drive.

    nwe

    I came to the conclusion that I needed a matching server for backups.
    So, I built one and replicate weekly. It sounds like you could dust off
    the R710. An enterprise-grade Ethernet connection between the primary
    and backup servers would encourage regular backups.


    I also replicate to a near-site external disk weekly, and rotate that
    with an off-site disk bi-monthly. Seagate says they make a 36 TB HDD,
    but I cannot find any for sale or in stock on the WWW. An external RAID enclosure could reach that capacity with smaller drives, but external
    drive enclosures have failed me too many times over the years.


    David

    --- PyGate Linux v1.5.14
    * Origin: Dragon's Lair, PyGate NNTP<>Fido Gate (3:633/10)
  • From Anssi Saari@3:633/10 to Unknown on Monday, May 25, 2026 11:50:02
    nwe <nwe@gitcoding.net> writes:

    If I'm correct, memtest86 is nearly useless on ECC RAM.

    There's a commercial variant that has an option to enable error
    injection mode but on the two computers I've tried it, the manufacturer
    has conveniently disabled that.

    The only remaining way for consumers to see ECC in action is apparently
    to undervolt the RAM to force errors. Assuming such a setting is
    available. And seems like lots of work for minimal gain.

    --- PyGate Linux v1.5.15
    * Origin: Dragon's Lair, PyGate NNTP<>Fido Gate (3:633/10)
  • From Andy Smith@3:633/10 to Unknown on Monday, May 25, 2026 13:20:01
    Hi,

    On Fri, May 22, 2026 at 04:58:49PM -0700, David Christensen wrote:
    On 5/22/26 13:22, nwe wrote:
    On 5/22/26 3:02 PM, Charles Curley wrote:
    If I'm correct, memtest86 is nearly useless on ECC RAM.

    MemTest86 v11.7 Free Edition claims to support ECC:

    https://www.memtest86.com/compare.html

    AFAICT memtest86+ does not support ECC. Some people suggest disabling ECC
    in BIOS/UEFI Setup and then testing with memtest86+.

    The only machines I have ECC RAM in also have Machine Check Exception
    messages go to a log in the firmware. I have experienced running
    memtest86 for a few successful complete passes and then finding messages
    in the firmware log about things that were corrected, which enabled me
    to locate the bad stick.

    Also, most of the time I've had RAM fail it's done so in a way that ECC
    can't fix, because ECC can only correct a single bit flip.

    So, I have continued ti find memtest86 useful although it would be a bit
    less so if I had no way to see the MCE log.

    Thanks,
    Andy

    --
    https://bitfolk.com/ -- No-nonsense VPS hosting

    --- PyGate Linux v1.5.15
    * Origin: Dragon's Lair, PyGate NNTP<>Fido Gate (3:633/10)
  • From David Christensen@3:633/10 to All on Monday, May 25, 2026 23:20:01
    On 5/25/26 04:16, Andy Smith wrote:
    Hi,


    Hello. :-)


    The only machines I have ECC RAM in also have Machine Check Exception messages go to a log in the firmware. I have experienced running
    memtest86 for a few successful complete passes and then finding messages
    in the firmware log about things that were corrected, which enabled me
    to locate the bad stick.

    Also, most of the time I've had RAM fail it's done so in a way that ECC
    can't fix, because ECC can only correct a single bit flip.

    So, I have continued ti find memtest86 useful although it would be a bit
    less so if I had no way to see the MCE log.


    It sounds like I need to learn about Debian's "collectd-core" module and mcelog(8) (?).


    David

    --- PyGate Linux v1.5.15
    * Origin: Dragon's Lair, PyGate NNTP<>Fido Gate (3:633/10)
  • From Charles Curley@3:633/10 to All on Monday, June 01, 2026 23:00:02
    On Fri, 22 May 2026 10:05:56 -0600
    Andrew Latham <lathama@gmail.com> wrote:

    I had an issue some months back. It turned out to be a bad RAM stick
    in my NAS. The issues would not show up on a restart but after some
    usage it would hit the RAM errors and :(

    This may not be your issue, but I remember how annoying it was to
    figure out.

    Thanks. I tried testing for this. I recently had one of two RAM sticks
    go bad. I bought replacements in April and installed them. To test for
    one of the new sticks being bad, I put the one good stick in instead of
    the two new ones. The problem shows up. So I'm not sure whether I have
    bad RAM or not.

    --
    Does anybody read signatures any more?

    https://charlescurley.com
    https://charlescurley.com/blog/

    --- PyGate Linux v1.5.15
    * Origin: Dragon's Lair, PyGate NNTP<>Fido Gate (3:633/10)
  • From Andrew Latham@3:633/10 to All on Monday, June 01, 2026 23:10:01
    Charles

    There are many things that might be wrong, just reading your OP I
    thought about how it matched up with my bad ram situation.

    Beyond chasing hardware, is there any situation where the checksums or
    tar are being run against files that are being written to? Should not
    hurt to ask

    Also, are there any time stamps you can match up to system logs and or dmes
    g?

    On Mon, Jun 1, 2026 at 2:57?PM Charles Curley
    <charlescurley@charlescurley.com> wrote:

    On Fri, 22 May 2026 10:05:56 -0600
    Andrew Latham <lathama@gmail.com> wrote:

    I had an issue some months back. It turned out to be a bad RAM stick
    in my NAS. The issues would not show up on a restart but after some
    usage it would hit the RAM errors and :(

    This may not be your issue, but I remember how annoying it was to
    figure out.

    Thanks. I tried testing for this. I recently had one of two RAM sticks
    go bad. I bought replacements in April and installed them. To test for
    one of the new sticks being bad, I put the one good stick in instead of
    the two new ones. The problem shows up. So I'm not sure whether I have
    bad RAM or not.

    --
    Does anybody read signatures any more?

    https://charlescurley.com
    https://charlescurley.com/blog/



    --
    - Andrew "lathama" Latham -

    --- PyGate Linux v1.5.15
    * Origin: Dragon's Lair, PyGate NNTP<>Fido Gate (3:633/10)
  • From Charles Curley@3:633/10 to All on Monday, June 01, 2026 23:20:01
    On Fri, 22 May 2026 09:53:17 -0600
    Charles Curley <charlescurley@charlescurley.com> wrote:

    To be thorough, I have run extended SMART tests on the hard drives,
    kicked mdadm into testing the RAID array, and fscked the LVM
    partitions on the RAID array. Only fsck turned up issues, and that
    has not stopped.

    Some additional testing.

    Suspecting a bad hard drive, I ran more extended tests on all four
    members of the RAID array. One showed problems:

    "Error 1 [0] occurred at disk power-on lifetime: 6777 hours (282 days + 9 hours)",
    " When the command that caused the error occurred, the device was active or idle.",
    "",
    " After command completion occurred, registers were:",
    " ER -- ST COUNT LBA_48 LH LM LL DV DC",
    " -- -- -- == -- == == == -- -- -- -- --",
    " 40 -- 51 00 01 00 00 00 00 00 00 40 00 Error: UNC 1 sectors at LBA = 0x00000000 = 0",
    "",
    " Commands leading to the command that caused the error were:",
    " CR FEATR COUNT LBA_48 LH LM LL DV DC Powered_Up_Time Command/Feature_Name",
    " -- == -- == -- == == == -- -- -- -- -- --------------- --------------------",
    " 25 00 00 00 01 00 00 00 00 00 00 40 00 00:08:36.585 READ DMA EXT",
    " ec 00 00 00 00 00 00 00 00 00 00 00 00 00:08:31.545 IDENTIFY DEVICE",
    " b0 00 da 00 00 00 00 00 c2 4f 00 00 00 00:08:31.542 SMART RETURN STATUS",
    " b0 00 d2 00 f1 00 00 00 c2 4f 00 00 00 00:08:31.541 SMART ENABLE/DISABLE ATTRIBUTE AUTOSAVE",
    " ec 00 00 00 00 00 00 00 00 00 00 00 00 00:08:31.541 IDENTIFY DEVICE",
    "",
    "SMART Extended Self-test Log Version: 1 (1 sectors)",
    "Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error",
    "# 1 Extended offline Completed without error 00% 6756 -",
    "# 2 Extended offline Completed without error 00% 6573 -",
    "# 3 Extended offline Completed without error 00% 102 -",
    "# 4 Short offline Completed without error 00% 96 -",
    "",


    So I did the obvious: I failed and remove the drive from the array. The
    problem still showed up, but not as many fails in the same data set.

    I have since added the drive back to the array, and am testing the
    array now.

    mdadm --monitor --test --oneshot /dev/md0

    I begin to wonder if I have a bad motherboard.

    --
    Does anybody read signatures any more?

    https://charlescurley.com
    https://charlescurley.com/blog/

    --- PyGate Linux v1.5.15
    * Origin: Dragon's Lair, PyGate NNTP<>Fido Gate (3:633/10)
  • From Andrew Latham@3:633/10 to All on Monday, June 01, 2026 23:30:01
    I hate to suggest other tangents but re-seat all the connectors and
    maybe a power supply test. A brown-out of power would cause issues
    like this.

    On Mon, Jun 1, 2026 at 3:15?PM Charles Curley
    <charlescurley@charlescurley.com> wrote:

    On Fri, 22 May 2026 09:53:17 -0600
    Charles Curley <charlescurley@charlescurley.com> wrote:

    To be thorough, I have run extended SMART tests on the hard drives,
    kicked mdadm into testing the RAID array, and fscked the LVM
    partitions on the RAID array. Only fsck turned up issues, and that
    has not stopped.

    Some additional testing.

    Suspecting a bad hard drive, I ran more extended tests on all four
    members of the RAID array. One showed problems:

    "Error 1 [0] occurred at disk power-on lifetime: 6777 hours (282 da
    ys + 9 hours)",
    " When the command that caused the error occurred, the device was
    active or idle.",
    "",
    " After command completion occurred, registers were:",
    " ER -- ST COUNT LBA_48 LH LM LL DV DC",
    " -- -- -- == -- == == == -- -- -- -- --",
    " 40 -- 51 00 01 00 00 00 00 00 00 40 00 Error: UNC 1 sectors at
    LBA = 0x00000000 = 0",
    "",
    " Commands leading to the command that caused the error were:",
    " CR FEATR COUNT LBA_48 LH LM LL DV DC Powered_Up_Time Command
    /Feature_Name",
    " -- == -- == -- == == == -- -- -- -- -- ----
    ----------- --------------------",
    " 25 00 00 00 01 00 00 00 00 00 00 40 00 00:08:36.585 READ DM
    A EXT",
    " ec 00 00 00 00 00 00 00 00 00 00 00 00 00:08:31.545 IDENTIF
    Y DEVICE",
    " b0 00 da 00 00 00 00 00 c2 4f 00 00 00 00:08:31.542 SMART R
    ETURN STATUS",
    " b0 00 d2 00 f1 00 00 00 c2 4f 00 00 00 00:08:31.541 SMART E
    NABLE/DISABLE ATTRIBUTE AUTOSAVE",
    " ec 00 00 00 00 00 00 00 00 00 00 00 00 00:08:31.541 IDENTIF
    Y DEVICE",
    "",
    "SMART Extended Self-test Log Version: 1 (1 sectors)",
    "Num Test_Description Status Remaining LifeTi
    me(hours) LBA_of_first_error",
    "# 1 Extended offline Completed without error 00% 67
    56 -",
    "# 2 Extended offline Completed without error 00% 65
    73 -",
    "# 3 Extended offline Completed without error 00% 1
    02 -",
    "# 4 Short offline Completed without error 00%
    96 -",
    "",


    So I did the obvious: I failed and remove the drive from the array. The problem still showed up, but not as many fails in the same data set.

    I have since added the drive back to the array, and am testing the
    array now.

    mdadm --monitor --test --oneshot /dev/md0

    I begin to wonder if I have a bad motherboard.

    --
    Does anybody read signatures any more?

    https://charlescurley.com
    https://charlescurley.com/blog/



    --
    - Andrew "lathama" Latham -

    --- PyGate Linux v1.5.15
    * Origin: Dragon's Lair, PyGate NNTP<>Fido Gate (3:633/10)
  • From Charles Curley@3:633/10 to All on Tuesday, June 02, 2026 00:00:01
    On Mon, 1 Jun 2026 15:23:02 -0600
    Andrew Latham <lathama@gmail.com> wrote:

    I hate to suggest other tangents but re-seat all the connectors and
    maybe a power supply test. A brown-out of power would cause issues
    like this.

    I did reseat all the data and power cables for the hard drives, the SSD
    system drive and the CD, etc. drive. I doubt I've had a brown-out, as
    the computer in question is on a UPS which has plenty of spare battery.

    As for the power supply test, not having a power supply tester, I have
    not done that.


    --
    Does anybody read signatures any more?

    https://charlescurley.com
    https://charlescurley.com/blog/

    --- PyGate Linux v1.5.15
    * Origin: Dragon's Lair, PyGate NNTP<>Fido Gate (3:633/10)
  • From David Christensen@3:633/10 to All on Tuesday, June 02, 2026 00:50:01
    On 6/1/26 14:15, Charles Curley wrote:
    Some additional testing.

    Suspecting a bad hard drive, I ran more extended tests on all four
    members of the RAID array. One showed problems:

    "Error 1 [0] occurred at disk power-on lifetime: 6777 hours (282 days + 9 hours)",
    " When the command that caused the error occurred, the device was active or idle.",
    "",
    " After command completion occurred, registers were:",
    " ER -- ST COUNT LBA_48 LH LM LL DV DC",
    " -- -- -- == -- == == == -- -- -- -- --",
    " 40 -- 51 00 01 00 00 00 00 00 00 40 00 Error: UNC 1 sectors at LBA = 0x00000000 = 0",
    "",
    " Commands leading to the command that caused the error were:",
    " CR FEATR COUNT LBA_48 LH LM LL DV DC Powered_Up_Time Command/Feature_Name",
    " -- == -- == -- == == == -- -- -- -- -- --------------- --------------------",
    " 25 00 00 00 01 00 00 00 00 00 00 40 00 00:08:36.585 READ DMA EXT",
    " ec 00 00 00 00 00 00 00 00 00 00 00 00 00:08:31.545 IDENTIFY DEVICE",
    " b0 00 da 00 00 00 00 00 c2 4f 00 00 00 00:08:31.542 SMART RETURN STATUS",
    " b0 00 d2 00 f1 00 00 00 c2 4f 00 00 00 00:08:31.541 SMART ENABLE/DISABLE ATTRIBUTE AUTOSAVE",
    " ec 00 00 00 00 00 00 00 00 00 00 00 00 00:08:31.541 IDENTIFY DEVICE",
    "",
    "SMART Extended Self-test Log Version: 1 (1 sectors)",
    "Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error",
    "# 1 Extended offline Completed without error 00% 6756 -",
    "# 2 Extended offline Completed without error 00% 6573 -",
    "# 3 Extended offline Completed without error 00% 102 -",
    "# 4 Short offline Completed without error 00% 96 -",
    "",


    So I did the obvious: I failed and remove the drive from the array. The problem still showed up, but not as many fails in the same data set.

    I have since added the drive back to the array, and am testing the
    array now.

    mdadm --monitor --test --oneshot /dev/md0

    I begin to wonder if I have a bad motherboard.


    Up until 2019, I was using Debian GNU/Linux on desktop hardware as a
    file server. When I upgraded to a server motherboard and ECC memory, I started seeing DMA errors. During trouble-shooting, I realized that I
    had been collecting SATA parts since the days of SATA I 150 Gbps --
    HBA's, cables, racks, and drawers. My file server had a mix of various
    known and unknown parts, including red SATA cables (red dye can cause
    copper conductors to oxidize into dust). So, I replaced all of the
    unknown and obsolete parts with new parts clearly rated and marked for
    SATA III 6 Gbps. The disk problems went away. When I wanted more
    HDD's, I bought SAS 6 Gbps HBA's, cables, and HDD's.


    Similarly, most of the memory problems I encountered were caused by incompatibility between the motherboard and the memory module(s). I
    suggest documenting your motherboard, documenting your memory modules,
    and doing the homework. Memory manufacturers typically have a search
    feature on their web site that will produce a list of compatible memory modules given a computer or motherboard make and model. eBay sellers
    often include the computer/motherboard make/model for pulled memory
    modules. And, you can always STFW.


    For a server, I prefer and recommended workstation/server motherboards,
    ECC memory, ext4/UFS for the system disk, and ZFS RAID10 for data.


    David

    --- PyGate Linux v1.5.15
    * Origin: Dragon's Lair, PyGate NNTP<>Fido Gate (3:633/10)
  • From Andrew Latham@3:633/10 to All on Tuesday, June 02, 2026 01:20:01
    I was less than clear, I meant a brown-out on the DC rail. Checksums
    do cause a power spike.

    On Mon, Jun 1, 2026 at 3:56?PM Charles Curley
    <charlescurley@charlescurley.com> wrote:

    On Mon, 1 Jun 2026 15:23:02 -0600
    Andrew Latham <lathama@gmail.com> wrote:

    I hate to suggest other tangents but re-seat all the connectors and
    maybe a power supply test. A brown-out of power would cause issues
    like this.

    I did reseat all the data and power cables for the hard drives, the SSD system drive and the CD, etc. drive. I doubt I've had a brown-out, as
    the computer in question is on a UPS which has plenty of spare battery.

    As for the power supply test, not having a power supply tester, I have
    not done that.


    --
    Does anybody read signatures any more?

    https://charlescurley.com
    https://charlescurley.com/blog/



    --
    - Andrew "lathama" Latham -

    --- PyGate Linux v1.5.15
    * Origin: Dragon's Lair, PyGate NNTP<>Fido Gate (3:633/10)
  • From David Christensen@3:633/10 to All on Tuesday, June 02, 2026 01:20:01
    On 6/1/26 14:55, Charles Curley wrote:
    I did reseat all the data and power cables for the hard drives, the SSD system drive and the CD, etc. drive.


    Good.


    I doubt I've had a brown-out, as
    the computer in question is on a UPS which has plenty of spare battery.


    Check the voltage at the UPS input with a good voltmeter. In the USA,
    voltage should be 120 VAC. National Electrical Code tolerance is +/- 5%.


    A good UPS should be able to accept dirty out-of-spec input and produce
    clean, tight output.


    Do not plug laser printers, photo copiers, or any other heavy loads into
    the computer UPS. Those machines draw significant current in surges and
    can clip the output of UPS's, causing the voltage to sag, lights to
    flicker, and electronics to malfunction.


    As for the power supply test, not having a power supply tester, I have
    not done that.


    Buy one. It will save your sanity.


    Related -- I bought good cases, power supplies, and fans when I built my
    most recent servers:

    https://www.fractal-design.com/products/cases/define/define-r5/black/

    https://www.fractal-design.com/products/power-supplies/ion/ion-2-platinum-660w/black/

    https://www.fractal-design.com/products/fans/dynamic/dynamic-gp-14/white/


    David

    --- PyGate Linux v1.5.15
    * Origin: Dragon's Lair, PyGate NNTP<>Fido Gate (3:633/10)
  • From nwe@3:633/10 to All on Tuesday, June 02, 2026 06:30:01
    On 6/1/26 4:23 PM, Andrew Latham wrote:

    I hate to suggest other tangents but re-seat all the connectors and
    maybe a power supply test. A brown-out of power would cause issues
    like this.

    +1

    I've seen too many hard-to-diagnose problems that disappeared after
    replacing a sketchy PSU. Random hangs and crashes, hard drive errors,
    video noise, hang on reboot, etc. I no longer try to go cheap when
    choosing a PSU for builds.

    For example, I currently recommend Corsair 750 Watt as a minimum,
    however, there are also other good quality brands. I concede there are
    lots of PCs out there running PSUs lesser than Corsair 750 Watt just
    fine. I do notice the PC on which I'm typing this is still running a
    cheap knock-off 350 Watt PSU that has not yet caused me any problems.



    --- PyGate Linux v1.5.15
    * Origin: Dragon's Lair, PyGate NNTP<>Fido Gate (3:633/10)
  • From Stefan Monnier@3:633/10 to All on Tuesday, June 02, 2026 16:10:01
    I've seen too many hard-to-diagnose problems that disappeared after
    replacing a sketchy PSU. Random hangs and crashes, hard drive errors, video noise, hang on reboot, etc. I no longer try to go cheap when choosing a PSU for builds.

    Yeah, I'm actually surprised the hardware hasn't caught on accordingly:
    while it's now standard for chips to monitor their temperature (and
    adjust their power consumption if it gets too high), I still haven't
    seen anything comparable that would detect and report when the input
    voltage goes out-of-range (and maybe also take steps to reduce the instantaneous power consumption?).

    Instead, we're in the dark, forced to try to avoid the problem
    by over-provisioning the power supply and pray that it was enough.


    === Stefan

    --- PyGate Linux v1.5.15
    * Origin: Dragon's Lair, PyGate NNTP<>Fido Gate (3:633/10)
  • From David Christensen@3:633/10 to All on Tuesday, June 02, 2026 21:10:01
    On 6/2/26 07:02, Stefan Monnier wrote:
    I've seen too many hard-to-diagnose problems that disappeared after
    replacing a sketchy PSU. Random hangs and crashes, hard drive errors, video >> noise, hang on reboot, etc. I no longer try to go cheap when choosing a PSU >> for builds.

    Yeah, I'm actually surprised the hardware hasn't caught on accordingly:
    while it's now standard for chips to monitor their temperature (and
    adjust their power consumption if it gets too high), I still haven't
    seen anything comparable that would detect and report when the input
    voltage goes out-of-range (and maybe also take steps to reduce the instantaneous power consumption?).

    Instead, we're in the dark, forced to try to avoid the problem
    by over-provisioning the power supply and pray that it was enough.


    === Stefan


    https://en.wikipedia.org/wiki/Lm_sensors

    lm_sensors (Linux-monitoring sensors) is a free open-source
    software-tool for Linux that provides tools and drivers for monitoring temperatures, voltage, humidity, and fans. It can also detect chassis intrusions.


    https://packages.debian.org/search?searchon=names&keywords=lm-sensors


    David

    --- PyGate Linux v1.5.15
    * Origin: Dragon's Lair, PyGate NNTP<>Fido Gate (3:633/10)
  • From Dan Ritter@3:633/10 to All on Tuesday, June 02, 2026 21:30:02
    David Christensen wrote:
    On 6/2/26 07:02, Stefan Monnier wrote:
    I've seen too many hard-to-diagnose problems that disappeared after replacing a sketchy PSU. Random hangs and crashes, hard drive errors, video
    noise, hang on reboot, etc. I no longer try to go cheap when choosing a PSU
    for builds.

    Yeah, I'm actually surprised the hardware hasn't caught on accordingly: while it's now standard for chips to monitor their temperature (and
    adjust their power consumption if it gets too high), I still haven't
    seen anything comparable that would detect and report when the input voltage goes out-of-range (and maybe also take steps to reduce the instantaneous power consumption?).

    https://en.wikipedia.org/wiki/Lm_sensors

    lm_sensors (Linux-monitoring sensors) is a free open-source software-tool
    for Linux that provides tools and drivers for monitoring temperatures, voltage, humidity, and fans. It can also detect chassis intrusions.

    Almost all of those sensors are on the motherboard or on
    attached cards; you'll frequently get to see voltages from the
    CPU, but hardly ever does a power supply tell you about its
    load.

    If you plug in a UPS, most (but not all) can give you
    information about the wall outlet power and how much is
    currently being drawn by the machines attached to the UPS.

    -dsr-

    --- PyGate Linux v1.5.15
    * Origin: Dragon's Lair, PyGate NNTP<>Fido Gate (3:633/10)
  • From Joe@3:633/10 to All on Tuesday, June 02, 2026 21:50:01
    On Tue, 2 Jun 2026 15:06:20 -0400
    Dan Ritter <dsr@randomstring.org> wrote:

    David Christensen wrote:
    On 6/2/26 07:02, Stefan Monnier wrote:
    I've seen too many hard-to-diagnose problems that disappeared
    after replacing a sketchy PSU. Random hangs and crashes, hard
    drive errors, video noise, hang on reboot, etc. I no longer
    try to go cheap when choosing a PSU for builds.

    Yeah, I'm actually surprised the hardware hasn't caught on
    accordingly: while it's now standard for chips to monitor their temperature (and adjust their power consumption if it gets too
    high), I still haven't seen anything comparable that would detect
    and report when the input voltage goes out-of-range (and maybe
    also take steps to reduce the instantaneous power consumption?).

    https://en.wikipedia.org/wiki/Lm_sensors

    lm_sensors (Linux-monitoring sensors) is a free open-source
    software-tool for Linux that provides tools and drivers for
    monitoring temperatures, voltage, humidity, and fans. It can also
    detect chassis intrusions.

    Almost all of those sensors are on the motherboard or on
    attached cards; you'll frequently get to see voltages from the
    CPU, but hardly ever does a power supply tell you about its
    load.

    They are also going to be polled, and will return the voltage they find
    at polling time. What they won't tell you is what various ripple
    voltages are, both the initial rectified mains at 100/120Hz and the
    residual switching frequencies of the various step-down regulators.

    Under some combination of conditions, including temperature, the
    combinations of ripple may allow a regulator output to drop out of spec
    for a microsecond, more than enough to corrupt a signal on e.g. a SATA
    line. Further, this may only happen once a week or so.

    Increasing ripple voltage is what happens when smoothing capacitor
    electrolytes dry out, which they will eventually do with age.

    Einstein allegedly said:

    "Insanity is doing the same thing over and over again and expecting
    different results"

    He had obviously never encountered the Intermittent Fault.

    --
    Joe

    --- PyGate Linux v1.5.15
    * Origin: Dragon's Lair, PyGate NNTP<>Fido Gate (3:633/10)
  • From Stefan Monnier@3:633/10 to All on Wednesday, June 03, 2026 16:40:01
    They are also going to be polled, and will return the voltage they find
    at polling time. What they won't tell you is what various ripple
    voltages are, both the initial rectified mains at 100/120Hz and the
    residual switching frequencies of the various step-down regulators.

    Yeah, I think we'd need a kind of sensor that doesn't give just the
    current voltage but gives a bracket of the lowest & highest voltage seen
    since the last measurement, or one that can trigger an interrupt if the
    voltage ever goes outside of a given range.


    === Stefan

    --- PyGate Linux v1.5.15
    * Origin: Dragon's Lair, PyGate NNTP<>Fido Gate (3:633/10)
  • From Stefan Monnier@3:633/10 to All on Wednesday, June 03, 2026 16:40:01
    Yeah, I'm actually surprised the hardware hasn't caught on accordingly:
    while it's now standard for chips to monitor their temperature (and
    adjust their power consumption if it gets too high), I still haven't
    seen anything comparable that would detect and report when the input
    voltage goes out-of-range (and maybe also take steps to reduce the
    instantaneous power consumption?).
    Instead, we're in the dark, forced to try to avoid the problem
    by over-provisioning the power supply and pray that it was enough.
    lm_sensors (Linux-monitoring sensors) is a free open-source software-tool
    for Linux that provides tools and drivers for monitoring temperatures, voltage, humidity, and fans. It can also detect chassis intrusions.

    Two problems:

    - AFAIK there is no standard system to actively use those voltage
    sensors to detect potentially harmful situations and take measures
    (not even logging weird voltage events seems to be standard).
    - My understanding is that the problems related to power supplies which
    we'd like to catch are related to *very* transient variations on input
    voltage, and AFAIK the infrastructure around those sensors just isn't
    equipped to detect such variations.


    === Stefan

    --- PyGate Linux v1.5.15
    * Origin: Dragon's Lair, PyGate NNTP<>Fido Gate (3:633/10)
  • From David Christensen@3:633/10 to All on Thursday, June 04, 2026 20:40:01
    On 6/3/26 19:12, Eben King wrote:
    On 6/3/26 10:36, Stefan Monnier wrote:
    They are also going to be polled, and will return the voltage they find
    at polling time. What they won't tell you is what various ripple
    voltages are, both the initial rectified mains at 100/120Hz and the
    residual switching frequencies of the various step-down regulators.

    Yeah, I think we'd need a kind of sensor that doesn't give just the
    current voltage but gives a bracket of the lowest & highest voltage seen
    since the last measurement, or one that can trigger an interrupt if the
    voltage ever goes outside of a given range.

    I want something that does a Fourier transform of the voltage data so I
    can see if there's ripple at any particular frequency. A live waterfall
    plot would be wonderful.

    https://www.tek.com/en/products/oscilloscopes/dpo70000sx


    David

    --- PyGate Linux v1.5.15
    * Origin: Dragon's Lair, PyGate NNTP<>Fido Gate (3:633/10)
  • From Joe@3:633/10 to All on Friday, June 05, 2026 19:30:01
    On Fri, 5 Jun 2026 11:54:00 -0400
    duh <fill_in_the_blanks@email.com> wrote:

    On 6/2/26 15:47, Joe wrote:

    Einstein allegedly said:

    "Insanity is doing the same thing over and over again and
    expecting different results"

    He had obviously never encountered the Intermittent Fault.

    I may be al ittle slow upstairs, but how does this differ from

    persistence??????? Just asking!ÿÿÿÿÿ :-)



    He seems to be implying that the universe is completely deterministic,
    whereas at the scale which humans can deal with, it isn't. He was
    implying that persistence in trying to locate an intermittent fault was
    akin to insanity.

    Which was particularly silly because he was also alleged to have said
    'God does not play dice', showing that he was aware of the existence of
    a device invented for the specific purpose of obtaining different
    results while doing the same (at a human level) thing.

    --
    Joe

    --- PyGate Linux v1.5.15
    * Origin: Dragon's Lair, PyGate NNTP<>Fido Gate (3:633/10)
  • From Charles Curley@3:633/10 to All on Sunday, June 21, 2026 23:10:01
    On Fri, 22 May 2026 09:53:17 -0600
    Charles Curley <charlescurley@charlescurley.com> wrote:

    I have four four terabyte hard drives. Each has a partition on it. The
    four partitions comprise a RAID 5 array using mdadm. On top of that,
    LUKS encryption, then LVM with ext4 logical volumes.

    I believe I have found a solution to this problem. I installed the
    backports kernel. Since then I have run more than four hours solid of
    tests and not found a single error.

    I did replace one hard drive. While that resulted in a quieter office,
    it did not solve the problem.

    Checking voltages from the power supply and the wall with a digital
    volt meter did not show any out of spec problems.

    --
    Does anybody read signatures any more?

    https://charlescurley.com
    https://charlescurley.com/blog/

    --- PyGate Linux v1.5.17
    * Origin: Dragon's Lair, PyGate NNTP<>Fido Gate (3:633/10)
  • From David Christensen@3:633/10 to All on Monday, June 22, 2026 09:40:01
    On Fri, 22 May 2026 09:53:17 -0600 Charles Curley wrote:
    I have four four terabyte hard drives. Each has a partition on it. The
    four partitions comprise a RAID 5 array using mdadm. On top of that,
    LUKS encryption, then LVM with ext4 logical volumes.

    On one LVM partition I have a number of backup files, tarred,
    bzipped, and sha256 and sha512 summed. I have a script which will find checksum files, and execute the appropriate program to test the
    archives. It puts each program into the background, parallising any
    number of checksum tests.


    On 6/21/26 14:05, Charles Curley wrote:
    I believe I have found a solution to this problem. I installed the
    backports kernel. Since then I have run more than four hours solid of
    tests and not found a single error.

    I did replace one hard drive. While that resulted in a quieter office,
    it did not solve the problem.

    Checking voltages from the power supply and the wall with a digital
    volt meter did not show any out of spec problems.


    I am glad that your storage is working correctly now.


    Please run and post the following commands with both the previous kernel
    and the backports kernel:

    $ cat /etc/debian_version

    $ uname -a


    I will assume your script spawns a separate, isolated process for each checksum file.


    If you have ruled out the power supply, memory, and disks, another
    possibility could be a race condition in the kernel and/or I/O stack
    that is triggered when multiple processes access storage in parallel.


    Are the checksum errors repeatable on another computer with a similar
    storage architecture and the previous kernel? If so, do they disappear
    with the backports kernel?


    David

    --- PyGate Linux v1.5.17
    * Origin: Dragon's Lair, PyGate NNTP<>Fido Gate (3:633/10)
  • From Charles Curley@3:633/10 to All on Monday, June 22, 2026 19:00:01
    On Mon, 22 Jun 2026 00:32:19 -0700
    David Christensen <dpchrist@holgerdanske.com> wrote:

    I am glad that your storage is working correctly now.

    Thank you.



    Please run and post the following commands with both the previous
    kernel and the backports kernel:

    $ cat /etc/debian_version

    $ uname -a

    Backport

    root@hawk:~# cat /etc/debian_version
    13.5
    root@hawk:~# cat /etc/os-release
    PRETTY_NAME="Debian GNU/Linux 13 (trixie)"
    NAME="Debian GNU/Linux"
    VERSION_ID="13"
    VERSION="13 (trixie)"
    VERSION_CODENAME=trixie
    DEBIAN_VERSION_FULL=13.5
    ID=debian
    HOME_URL="https://www.debian.org/"
    SUPPORT_URL="https://www.debian.org/support" BUG_REPORT_URL="https://bugs.debian.org/"
    root@hawk:~# uname -s
    Linux
    root@hawk:~# uname -a
    Linux hawk 7.0.10+deb13-amd64 #1 SMP PREEMPT_DYNAMIC Debian 7.0.10-1~bpo13+1 (2026-05-28) x86_64 GNU/Linux
    root@hawk:~#

    Prior kernel is the same except the kernel is

    linux-image-6.12.90+deb13.1-amd64 6.12.90-2 amd64

    I think that this problem first showed up late April or early May. If
    that is correct, the following extracts from my apt logs might help.
    These were copied and pasted unwrapped, so unless they were mangled in
    transit, long lines should be preserved.

    Start-Date: 2026-04-13 12:46:25
    Commandline: apt install -t trixie-backports linux-image-amd64
    Install: linux-image-6.19.10+deb13-amd64:amd64 (6.19.10-1~bpo13+1, automatic), linux-modules-6.19.10+deb13-amd64:amd64 (6.19.10-1~bpo13+1, automatic), linux-base-amd64:amd64 (6.19.10-1~bpo13+1, automatic), linux-binary-6.19.10+deb13-amd64:amd64 (6.19.10-1~bpo13+1, automatic), linux-base-6.19.10+deb13-amd64:amd64 (6.19.10-1~bpo13+1, automatic)
    Upgrade: linux-image-amd64:amd64 (6.12.74-2, 6.19.10-1~bpo13+1), linux-libc-dev:amd64 (6.12.74-2, 6.19.10-1~bpo13+1)
    End-Date: 2026-04-13 12:46:52

    Start-Date: 2026-04-16 08:45:37
    Commandline: apt remove linux-image-amd64
    Remove: linux-image-amd64:amd64 (6.19.10-1~bpo13+1)
    End-Date: 2026-04-16 08:45:39

    Start-Date: 2026-04-16 08:45:47
    Commandline: apt install linux-image-amd64
    Install: linux-image-amd64:amd64 (6.12.74-2)
    End-Date: 2026-04-16 08:45:49

    Start-Date: 2026-04-18 11:27:13
    Commandline: apt install -t trixie-backports linux-image-amd64
    Install: linux-image-6.19.11+deb13-amd64:amd64 (6.19.11-1~bpo13+1,
    automatic), linux-modules-6.19.11+deb13-amd64:amd64 (6.19.11-1~bpo13+1, automatic), linux-binary-6.19.11+deb13-amd64:amd64 (6.19.11-1~bpo13+1, automatic) Upgrade: linux-image-amd64:amd64 (6.12.74-2,
    6.19.11-1~bpo13+1) End-Date: 2026-04-18 11:27:44

    Start-Date: 2026-04-21 10:35:49
    Commandline: apt purge linux-image-amd64
    Purge: linux-image-amd64:amd64 (6.19.11-1~bpo13+1)
    End-Date: 2026-04-21 10:35:52

    Start-Date: 2026-04-21 10:36:06
    Commandline: apt install linux-image-amd64
    Install: linux-image-amd64:amd64 (6.12.74-2)
    End-Date: 2026-04-21 10:36:08


    Start-Date: 2026-05-01 04:40:22
    Commandline: /usr/bin/unattended-upgrade
    Install: linux-image-6.12.85+deb13-amd64:amd64 (6.12.85-1, automatic)
    Upgrade: linux-image-amd64:amd64 (6.12.74-2, 6.12.85-1)
    End-Date: 2026-05-01 04:40:45

    Start-Date: 2026-05-09 04:49:58
    Commandline: /usr/bin/unattended-upgrade
    Install: linux-image-6.12.86+deb13-amd64:amd64 (6.12.86-1, automatic)

    Start-Date: 2026-05-16 04:01:47
    Commandline: /usr/bin/unattended-upgrade
    Install: linux-image-6.12.88+deb13-amd64:amd64 (6.12.88-1, automatic)

    Start-Date: 2026-05-24 11:54:07
    Commandline: /usr/bin/unattended-upgrade
    Install: linux-image-6.12.90+deb13-amd64:amd64 (6.12.90-1, automatic)

    Start-Date: 2026-05-29 04:08:32
    Commandline: /usr/bin/unattended-upgrade
    Install: linux-image-6.12.90+deb13.1-amd64:amd64 (6.12.90-2, automatic)

    Start-Date: 2026-06-20 11:19:40
    Commandline: apt install -t trixie-backports linux-image-amd64

    Apparently, twice in April I had tried the backports kernel(s), and
    found them unsatisfactory. So possibly the fix came in between linux-image-6.19.11+deb13-amd64 and the current backports kernel, linux-image-7.0.10+deb13-amd64. Quite possibly a major version number
    might even inadvertently fix something as subtle as this.



    I will assume your script spawns a separate, isolated process for
    each checksum file.

    Correct. It does a find on *.sha256sums, *.sha512sums, and several other suffixes. It then sliced and dices to figure out the appropriate
    program to call, sha256sum and sha512sum, respectively.

    The files are created using paths relative to the current directory, so
    when the script runs, it will pushd to that directory

    The key line is

    nice "${prog}" "${opts}" -c "${file}" &

    Where $prog is the result of the slicing and dicing, opts='--quiet',
    and $file the file to be scanned.

    I recently added the option to limit the number of background tasks to
    the number of processors (nproc --all). That reduces but does not
    eliminate the number of errors.



    If you have ruled out the power supply, memory, and disks, another possibility could be a race condition in the kernel and/or I/O stack
    that is triggered when multiple processes access storage in parallel.

    I wouldn't say those are all ruled out. But the fact that the
    backport kernel is a major version number change, and that it appears
    to have solved the problem is highly suggestive. With that caveat, I
    concur. That is definitely something for the kernel folks to look at.



    Are the checksum errors repeatable on another computer with a similar storage architecture and the previous kernel? If so, do they
    disappear with the backports kernel?

    I have a much more recent and much faster computer, peregrine, with nvme storage and 12 cores. Hawk has eight cores, and spinning rust. Hawk is
    where the problem has shown up. Peregrine does not show the problem.

    For 18 gig of data, hawk: 2m18.318s, peregrine 0m24.233s.




    David



    --
    Does anybody read signatures any more?

    https://charlescurley.com
    https://charlescurley.com/blog/

    --- PyGate Linux v1.5.17
    * Origin: Dragon's Lair, PyGate NNTP<>Fido Gate (3:633/10)
  • From David Christensen@3:633/10 to All on Tuesday, June 23, 2026 00:50:02
    On 6/22/26 09:54, Charles Curley wrote:
    On Mon, 22 Jun 2026 00:32:19 -0700 David Christensen wrote:
    Please run and post the following commands with both the previous
    kernel and the backports kernel:

    $ cat /etc/debian_version

    $ uname -a

    Backport

    root@hawk:~# cat /etc/debian_version
    13.5
    root@hawk:~# cat /etc/os-release
    PRETTY_NAME="Debian GNU/Linux 13 (trixie)"
    NAME="Debian GNU/Linux"
    VERSION_ID="13"
    VERSION="13 (trixie)"
    VERSION_CODENAME=trixie
    DEBIAN_VERSION_FULL=13.5
    ID=debian
    HOME_URL="https://www.debian.org/" SUPPORT_URL="https://www.debian.org/support" BUG_REPORT_URL="https://bugs.debian.org/"
    root@hawk:~# uname -s
    Linux
    root@hawk:~# uname -a
    Linux hawk 7.0.10+deb13-amd64 #1 SMP PREEMPT_DYNAMIC Debian 7.0.10-1~bpo13+1 (2026-05-28) x86_64 GNU/Linux
    root@hawk:~#

    Prior kernel is the same except the kernel is

    linux-image-6.12.90+deb13.1-amd64 6.12.90-2 amd64


    Good information.


    I think that this problem first showed up late April or early May. If
    that is correct, the following extracts from my apt logs might help.
    These were copied and pasted unwrapped, so unless they were mangled in transit, long lines should be preserved.

    ...

    Apparently, twice in April I had tried the backports kernel(s), and
    found them unsatisfactory. So possibly the fix came in between linux-image-6.19.11+deb13-amd64 and the current backports kernel, linux-image-7.0.10+deb13-amd64.


    AIUI if someone can come up with a shell script or program whose exit
    value reliably indicates the presence or absence of a bug, Git can do a
    binary search over a range of commits and locate the commit where the
    bug originated.


    Quite possibly a major version number
    might even inadvertently fix something as subtle as this.


    Agreed.


    I will assume your script spawns a separate, isolated process for
    each checksum file.

    Correct. It does a find on *.sha256sums, *.sha512sums, and several other suffixes. It then sliced and dices to figure out the appropriate
    program to call, sha256sum and sha512sum, respectively.

    The files are created using paths relative to the current directory, so
    when the script runs, it will pushd to that directory

    The key line is

    nice "${prog}" "${opts}" -c "${file}" &

    Where $prog is the result of the slicing and dicing, opts='--quiet',
    and $file the file to be scanned.


    I use a similar workflow for image files and also wrote a script to
    generate and verify checksum files.


    I recently added the option to limit the number of background tasks to
    the number of processors (nproc --all). That reduces but does not
    eliminate the number of errors.


    That is another clue that there is a race condition related to parallel I/O.


    If you have ruled out the power supply, memory, and disks, another
    possibility could be a race condition in the kernel and/or I/O stack
    that is triggered when multiple processes access storage in parallel.

    I wouldn't say those are all ruled out. But the fact that the
    backport kernel is a major version number change, and that it appears
    to have solved the problem is highly suggestive. With that caveat, I
    concur. That is definitely something for the kernel folks to look at.


    Agreed.


    Are the checksum errors repeatable on another computer with a similar
    storage architecture and the previous kernel? If so, do they
    disappear with the backports kernel?

    I have a much more recent and much faster computer, peregrine, with nvme storage and 12 cores. Hawk has eight cores, and spinning rust. Hawk is
    where the problem has shown up. Peregrine does not show the problem.

    For 18 gig of data, hawk: 2m18.318s, peregrine 0m24.233s.


    That clue makes me think the race condition is the SCSI stack.


    It looks like there is a newer kernel for Trixie. It is best to file
    bug reports against current packages. Can you test it?

    https://packages.debian.org/stable/kernel/linux-image-6.12.94+deb13-amd64


    David

    --- PyGate Linux v1.5.17
    * Origin: Dragon's Lair, PyGate NNTP<>Fido Gate (3:633/10)
  • From The Wanderer@3:633/10 to All on Tuesday, June 23, 2026 04:20:01
    On 2026-06-22 at 18:40, David Christensen wrote:
    On 6/22/26 09:54, Charles Curley wrote:
    I think that this problem first showed up late April or early May.
    If that is correct, the following extracts from my apt logs might
    help. These were copied and pasted unwrapped, so unless they were
    mangled in transit, long lines should be preserved.

    ...

    Apparently, twice in April I had tried the backports kernel(s),
    and found them unsatisfactory. So possibly the fix came in between
    linux-image-6.19.11+deb13-amd64 and the current backports kernel,
    linux-image-7.0.10+deb13-amd64.

    AIUI if someone can come up with a shell script or program whose exit
    value reliably indicates the presence or absence of a bug, Git can
    do a binary search over a range of commits and locate the commit
    where the bug originated.
    ...assuming that there aren't any commits broken for other reasons, or otherwise commits where the codebase can't be built far enough for the
    script or program to be able to do its thing, that will be hit along the
    way.
    That could be folded in under "reliably", of course - but it's
    sufficiently far out of the scope of what people might be expected to
    think of for that term, if not previously familiar with what such
    bisections can involve, that it seems worth calling out explicitly.
    That said, git can actually do this even *without* such a
    script/program, as long as you're willing and able to test each
    candidate commit manually. The use of a script or program to automate it
    is actually a subset of the functionality of the 'git bisect'
    sub-command, specifically the sub-sub-command 'git bisect run'; see 'git
    help bisect' for the documentation, most of which is about the version
    of the process where the validation of each commit is done manually
    rather than by a script.
    --
    The Wanderer
    The reasonable man adapts himself to the world; the unreasonable one
    persists in trying to adapt the world to himself. Therefore all
    progress depends on the unreasonable man. -- George Bernard Shaw


    --- PyGate Linux v1.5.17
    * Origin: Dragon's Lair, PyGate NNTP<>Fido Gate (3:633/10)
  • From Charles Curley@3:633/10 to All on Friday, June 26, 2026 00:50:01
    On Mon, 22 Jun 2026 15:40:11 -0700
    David Christensen <dpchrist@holgerdanske.com> wrote:

    It looks like there is a newer kernel for Trixie. It is best to file
    bug reports against current packages. Can you test it?

    https://packages.debian.org/stable/kernel/linux-image-6.12.94+deb13-amd64

    Tested. It came up with all sorts of fails.

    charles@hawk:~$ uname -a
    Linux hawk 6.12.94+deb13-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.12.94-1 (2026-06-20) x86_64 GNU/Linux
    charles@hawk:~$

    I gather you are suggesting I file a bug against this kernel, linux-image-6.12.94+deb13-amd64 6.12.94-1.

    --
    Does anybody read signatures any more?

    https://charlescurley.com
    https://charlescurley.com/blog/

    --- PyGate Linux v1.5.18
    * Origin: Dragon's Lair, PyGate NNTP<>Fido Gate (3:633/10)
  • From David Christensen@3:633/10 to All on Friday, June 26, 2026 03:10:02
    On 6/25/26 15:40, Charles Curley wrote:
    On Mon, 22 Jun 2026 15:40:11 -0700
    David Christensen <dpchrist@holgerdanske.com> wrote:

    It looks like there is a newer kernel for Trixie. It is best to file
    bug reports against current packages. Can you test it?

    https://packages.debian.org/stable/kernel/linux-image-6.12.94+deb13-amd64

    Tested. It came up with all sorts of fails.

    charles@hawk:~$ uname -a
    Linux hawk 6.12.94+deb13-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.12.94-1 (2026-06-20) x86_64 GNU/Linux
    charles@hawk:~$

    I gather you are suggesting I file a bug against this kernel, linux-image-6.12.94+deb13-amd64 6.12.94-1.


    That would make sense, especially since Linux 6.12 appears to be the
    kernel for Debian Stable (Trixie):

    https://packages.debian.org/trixie/all/allpackages


    David

    --- PyGate Linux v1.5.18
    * Origin: Dragon's Lair, PyGate NNTP<>Fido Gate (3:633/10)
  • From Charles Curley@3:633/10 to All on Friday, June 26, 2026 05:50:02
    On Thu, 25 Jun 2026 18:07:21 -0700
    David Christensen <dpchrist@holgerdanske.com> wrote:

    That would make sense, especially since Linux 6.12 appears to be the
    kernel for Debian Stable (Trixie):

    https://packages.debian.org/trixie/all/allpackages

    OK, will do.

    However, things are back up in the air. I rebooted to 7.0.10, and ran
    some backups. The prior testing has been all reading: checksum
    verification. Nothing on the disk actually changed. This backs from the
    SSD to the RAID array. I had several directories fail with the error
    message "failed: Bad message (74)", whatever that means.

    I immediately fscked the logical volume. No errors. I then diffed the directories against the originals. There were for instances of files
    found in the backups and not in the originals. Otherwise the "failed" directories were intact and duplicated the originals. I conjecture that
    rsync attempted to delete them and failed to do so. I conjecture that
    the next pass will get them.

    --
    Does anybody read signatures any more?

    https://charlescurley.com
    https://charlescurley.com/blog/

    --- PyGate Linux v1.5.18
    * Origin: Dragon's Lair, PyGate NNTP<>Fido Gate (3:633/10)
  • From David Christensen@3:633/10 to All on Saturday, June 27, 2026 01:50:02
    On 6/25/26 20:44, Charles Curley wrote:
    On Thu, 25 Jun 2026 18:07:21 -0700 David Christensen wrote:
    That would make sense, especially since Linux 6.12 appears to be the
    kernel for Debian Stable (Trixie):

    https://packages.debian.org/trixie/all/allpackages

    OK, will do.


    Thank you.


    However, things are back up in the air. I rebooted to 7.0.10, and ran
    some backups. The prior testing has been all reading: checksum
    verification. Nothing on the disk actually changed. This backs from the
    SSD to the RAID array. I had several directories fail with the error
    message "failed: Bad message (74)", whatever that means.

    I immediately fscked the logical volume. No errors. I then diffed the directories against the originals. There were for instances of files
    found in the backups and not in the originals. Otherwise the "failed" directories were intact and duplicated the originals. I conjecture that
    rsync attempted to delete them and failed to do so. I conjecture that
    the next pass will get them.


    Are you booting the OS disk and running backups, or are you booting live
    media and backing up?


    Please post a console session that shows prompts, commands entered, and
    output displayed. If you are using shell scripts, please enable the
    "xtrace" option.


    David

    --- PyGate Linux v1.5.18
    * Origin: Dragon's Lair, PyGate NNTP<>Fido Gate (3:633/10)