Forum: Jacob's Hideout BBS

How to salvage a degraded mdadm RAID1 with as little data loss as poss

From Paul Leiber@3:633/10 to All on Saturday, June 20, 2026 18:10:02

Subject: How to salvage a degraded mdadm RAID1 with as little data loss as possible?

Hi everybody,

I am using a MDADM software RAID1 as a dedicated MariaDB database file system. The devices used for the RAID1 are two partitions of identical size which are LUKS encrypted. The devices are decrypted via entries in /etc/crypttab. The resulting RAID1 is called /dev/md0, formatted as XFS. (For completeness' sake: md0 is then forwarded to a database VM which stores the database on the device, but that shouldn't play a role for my questions, IIUC.)

Some time ago, I noticed that the database content changed after a reboot. Recent changes to the databases were seemingly lost. I couldn't pinpoint the cause for this, but attributed it to an unclean shutdown of the database prior to reboot of the database VM. Data loss in a database of course is not ideal, so I kept on looking. It seems that I have now identified the root cause for the data loss in the RAID1.

I checked the RAID1:

root@xxx:~# cat /proc/mdstat
Personalities : [raid0] [raid1] [raid6] [raid5] [raid4] [raid10]
md0 : active raid1 dm-30[2]
� � � 1073593280 blocks super 1.2 [2/1] [_U]
� � � bitmap: 7/8 pages [28KB], 65536KB chunk

The [_U] seems to indicate that the RAID1 is currently degraded and using just one of the two partitions is currently used for the RAID1.

Checking the partitions the RAID1 is based on gives the following output:

root@xxx:~# mdadm --examine /dev/dm-31
/dev/dm-31:
� � � � � Magic : a92b4efc
� � � � Version : 1.2
� � Feature Map : 0x1
� � �Array UUID : 6834a10d:edb03a51:cef24158:f9abc812
� � � � � �Name : xxx:0� (local to host xxx)
� Creation Time : Fri Nov� 4 16:05:45 2022
� � �Raid Level : raid1
� �Raid Devices : 2

�Avail Dev Size : 2147186655 sectors (1023.86 GiB 1099.36 GB)
� � �Array Size : 1073593280 KiB (1023.86 GiB 1099.36 GB)
� Used Dev Size : 2147186560 sectors (1023.86 GiB 1099.36 GB)
� � Data Offset : 264192 sectors
� �Super Offset : 8 sectors
� �Unused Space : before=264112 sectors, after=95 sectors
� � � � � State : clean
� � Device UUID : 01c96166:ee782cc7:57bcf889:2ee53b43

Internal Bitmap : 8 sectors from superblock
� � Update Time : Wed Jun 17 13:17:46 2026
� Bad Block Log : 512 entries available at offset 16 sectors
� � � �Checksum : d46fa108 - correct
� � � � �Events : 5397997

� �Device Role : Active device 0
� �Array State : A. ('A' == active, '.' == missing, 'R' == replacing)

and

root@xxx:~# mdadm --examine /dev/dm-30
/dev/dm-30:
� � � � � Magic : a92b4efc
� � � � Version : 1.2
� � Feature Map : 0x1
� � �Array UUID : 6834a10d:edb03a51:cef24158:f9abc812
� � � � � �Name : xxx:0� (local to host xxx)
� Creation Time : Fri Nov� 4 16:05:45 2022
� � �Raid Level : raid1
� �Raid Devices : 2

�Avail Dev Size : 2147186655 sectors (1023.86 GiB 1099.36 GB)
� � �Array Size : 1073593280 KiB (1023.86 GiB 1099.36 GB)
� Used Dev Size : 2147186560 sectors (1023.86 GiB 1099.36 GB)
� � Data Offset : 264192 sectors
� �Super Offset : 8 sectors
� �Unused Space : before=264112 sectors, after=95 sectors
� � � � � State : clean
� � Device UUID : 637fc155:8fb21b7c:fff27b71:c7ea1094

Internal Bitmap : 8 sectors from superblock
� � Update Time : Fri Jun 19 21:34:19 2026
� Bad Block Log : 512 entries available at offset 16 sectors
� � � �Checksum : b110fe9d - correct
� � � � �Events : 4814810

� �Device Role : Active device 1
� �Array State : .A ('A' == active, '.' == missing, 'R' == replacing)

It can be seen that update time and number of events differ between both partitions, which seems to indicate different data. I am assuming that due to some circumstance (wild guess: a race condition when unlocking the LUKS encryption), the RAID1 is more or less randomly using only one of the partitions, which then results in differing database versions, depending on which of the two partitions is currently used.

I also think that I found a possible cause for this misbehaviour. My /etc/mdadm/mdadm.conf contains just the default settings:

# mdadm.conf
#
# !NB! Run update-initramfs -u after updating this file.
# !NB! This will ensure that initramfs has an uptodate copy.
#
# Please refer to mdadm.conf(5) for information about this file.
#

# by default (built-in), scan all partitions (/proc/partitions) and all
# containers for MD superblocks. alternatively, specify devices to scan, using # wildcards if desired.
#DEVICE partitions containers

# automatically tag new arrays as belonging to the local system
HOMEHOST <system>

# instruct the monitoring daemon where to send mail alerts
MAILADDR root

# definitions of existing MD arrays

# This configuration was auto-generated on Fri, 04 Nov 2022 15:52:55 +0100 by mkconf

Somehow, I missed to include the RAID1 information for md0 to the configuration file (e. g. by entering root@localhost:~# mdadm --detail --scan /dev/md0 >> /etc/mdadm/mdadm.conf). I am not sure if this actually is the cause and adding that information would solve the issue.

My questions are the following:

1. Is my analysis valid in principle? Especially: Could the root cause for this issue be that mdadm.conf is missing the information for md0, and could adding that information prevent data loss or inconsistencies in the future?
2. Can I (re)create the RAID1 md0 or (re-)add the missing partition in an easy way that no or at least not all information is lost? If yes, how?

I assume that it might not be possible to sync the data from two different database versions without data loss. If this assumption is correct, I am willing to use one data set (e. g. the one on dm-31) and discard the other data set (e. g. the one on dm-30). Guides I found so far describe how to set up a new RAID1 and copy the data from a partition to the new RAID1. However, perhaps I am wondering if it is possible to (re-)create a RAID1 using just one existing partition (e. g. dm-31) without losing the data on this partition, and then add the other partition to the RAID1?

The databases are backed up regularly. However, the backup is incremental, and it seems that the different database versions are messing up the incremental backup, therefore my last valid backup doesn't include the most recent changes to the database. If it is not possible to salvage the data on one or both of the partitions, I could swallow the bitter pill and go back to a previous database state without unacceptable consequences. However, I would like to try to salvage as much data as possible.

Thank you in advance,

Paul

--- PyGate Linux v1.5.17
* Origin: Dragon's Lair, PyGate NNTP<>Fido Gate (3:633/10)

From Charles Curley@3:633/10 to All on Saturday, June 20, 2026 20:10:01

Subject: Re: How to salvage a degraded mdadm RAID1 with as little data loss as possible?

On Sat, 20 Jun 2026 18:01:15 +0200
Paul Leiber <paul@onlineschubla.de> wrote:

Somehow, I missed to include the RAID1 information for md0 to the configuration file (e. g. by entering root@localhost:~# mdadm
--detail --scan /dev/md0 >> /etc/mdadm/mdadm.conf). I am not sure if
this actually is the cause and adding that information would solve
the issue.
My questions are the following:

1. Is my analysis valid in principle? Especially: Could the root
cause for this issue be that mdadm.conf is missing the information
for md0, and could adding that information prevent data loss or inconsistencies in the future?

I doubt that this is the culprit. the man page for mdadm says, in part:

Assemble
Assemble the components of a previously created array
into an active array. Components can be explicitly given or can
be searched for. mdadm checks that the components do form a
bona fide array, and can, on request, fiddle superblock
information so as to assemble a faulty array.

So mdadm *should* find both devices. But it might not be. And adding
that line will not hurt. I have a similar line in my mdadm.conf.

I built my RAID array up a bit differently that you did yours. You made
your partitions, put LUKS on the partitions, then the RAID on top of
that. I have the partitions, then the RAID array, LUKS on top of that,
then LVM, with file systems on top of the LVs. But I know of no reason
your setup shouldn't work.

I have found that when I have multiple LUKS partitions, giving them
all the same passphrase means I need give only one passphrase to
decrypt on boot.

2. Can I (re)create the RAID1 md0 or (re-)add the missing partition
in an easy way that no or at least not all information is lost? If
yes, how?

Yes. For the gory details see https://oneuptime.com/blog/post/2026-03-02-how-to-replace-a-failed-disk-in-mdadm-raid-on-ubuntu/view.

In short,

* Fail the offending disk. It looks like this has already happened, but
it shouldn't hurt to do it again.

* Remove the disk from the array.

* Add the disk back in again. This should trigger rebuilding, which
takes a while. During the rebuild, the data should be both readable
and writable. You may monitor with:

cat /proc/mdstat

--
Does anybody read signatures any more?

https://charlescurley.com
https://charlescurley.com/blog/

--- PyGate Linux v1.5.17
* Origin: Dragon's Lair, PyGate NNTP<>Fido Gate (3:633/10)

From David Christensen@3:633/10 to All on Saturday, June 20, 2026 21:10:01

Subject: Re: How to salvage a degraded mdadm RAID1 with as little data loss as possible?

On 6/20/26 09:01, Paul Leiber wrote:

Hi everybody,

I am using a MDADM software RAID1 as a dedicated MariaDB database file system. The devices used for the RAID1 are two partitions of identical
size which are LUKS encrypted. The devices are decrypted via entries
in /etc/crypttab. The resulting RAID1 is called /dev/md0, formatted as
XFS. (For completeness' sake: md0 is then forwarded to a database VM
which stores the database on the device, but that shouldn't play a role
for my questions, IIUC.)

Some time ago, I noticed that the database content changed after a
reboot. Recent changes to the databases were seemingly lost. I couldn't pinpoint the cause for this, but attributed it to an unclean shutdown of
the database prior to reboot of the database VM. Data loss in a database
of course is not ideal, so I kept on looking. It seems that I have now identified the root cause for the data loss in the RAID1.

I checked the RAID1:

root@xxx:~# cat /proc/mdstat
Personalities : [raid0] [raid1] [raid6] [raid5] [raid4] [raid10]
md0 : active raid1 dm-30[2]
� � � 1073593280 blocks super 1.2 [2/1] [_U]
� � � bitmap: 7/8 pages [28KB], 65536KB chunk

The [_U] seems to indicate that the RAID1 is currently degraded and
using just one of the two partitions is currently used for the RAID1.

Checking the partitions the RAID1 is based on gives the following output:

root@xxx:~# mdadm --examine /dev/dm-31
/dev/dm-31:
� � � � � Magic : a92b4efc
� � � � Version : 1.2
� � Feature Map : 0x1
� � �Array UUID : 6834a10d:edb03a51:cef24158:f9abc812
� � � � � �Name : xxx:0� (local to host xxx)
� Creation Time : Fri Nov� 4 16:05:45 2022
� � �Raid Level : raid1
� �Raid Devices : 2

�Avail Dev Size : 2147186655 sectors (1023.86 GiB 1099.36 GB)
� � �Array Size : 1073593280 KiB (1023.86 GiB 1099.36 GB)
� Used Dev Size : 2147186560 sectors (1023.86 GiB 1099.36 GB)
� � Data Offset : 264192 sectors
� �Super Offset : 8 sectors
� �Unused Space : before=264112 sectors, after=95 sectors
� � � � � State : clean
� � Device UUID : 01c96166:ee782cc7:57bcf889:2ee53b43

Internal Bitmap : 8 sectors from superblock
� � Update Time : Wed Jun 17 13:17:46 2026
� Bad Block Log : 512 entries available at offset 16 sectors
� � � �Checksum : d46fa108 - correct
� � � � �Events : 5397997

� �Device Role : Active device 0
� �Array State : A. ('A' == active, '.' == missing, 'R' == replacing)

and

root@xxx:~# mdadm --examine /dev/dm-30
/dev/dm-30:
� � � � � Magic : a92b4efc
� � � � Version : 1.2
� � Feature Map : 0x1
� � �Array UUID : 6834a10d:edb03a51:cef24158:f9abc812
� � � � � �Name : xxx:0� (local to host xxx)
� Creation Time : Fri Nov� 4 16:05:45 2022
� � �Raid Level : raid1
� �Raid Devices : 2

�Avail Dev Size : 2147186655 sectors (1023.86 GiB 1099.36 GB)
� � �Array Size : 1073593280 KiB (1023.86 GiB 1099.36 GB)
� Used Dev Size : 2147186560 sectors (1023.86 GiB 1099.36 GB)
� � Data Offset : 264192 sectors
� �Super Offset : 8 sectors
� �Unused Space : before=264112 sectors, after=95 sectors
� � � � � State : clean
� � Device UUID : 637fc155:8fb21b7c:fff27b71:c7ea1094

Internal Bitmap : 8 sectors from superblock
� � Update Time : Fri Jun 19 21:34:19 2026
� Bad Block Log : 512 entries available at offset 16 sectors
� � � �Checksum : b110fe9d - correct
� � � � �Events : 4814810

� �Device Role : Active device 1
� �Array State : .A ('A' == active, '.' == missing, 'R' == replacing)

It can be seen that update time and number of events differ between both partitions, which seems to indicate different data. I am assuming that
due to some circumstance (wild guess: a race condition when unlocking
the LUKS encryption), the RAID1 is more or less randomly using only one
of the partitions, which then results in differing database versions, depending on which of the two partitions is currently used.

I also think that I found a possible cause for this misbehaviour. My / etc/mdadm/mdadm.conf contains just the default settings:

# mdadm.conf
#
# !NB! Run update-initramfs -u after updating this file.
# !NB! This will ensure that initramfs has an uptodate copy.
#
# Please refer to mdadm.conf(5) for information about this file.
#

# by default (built-in), scan all partitions (/proc/partitions) and all
# containers for MD superblocks. alternatively, specify devices to scan, using
# wildcards if desired.
#DEVICE partitions containers

# automatically tag new arrays as belonging to the local system
HOMEHOST <system>

# instruct the monitoring daemon where to send mail alerts
MAILADDR root

# definitions of existing MD arrays

# This configuration was auto-generated on Fri, 04 Nov 2022 15:52:55
+0100 by mkconf

Somehow, I missed to include the RAID1 information for md0 to the configuration file (e. g. by entering root@localhost:~# mdadm --detail --scan /dev/md0 >> /etc/mdadm/mdadm.conf). I am not sure if this
actually is the cause and adding that information would solve the issue.

My questions are the following:

1. Is my analysis valid in principle? Especially: Could the root cause
for this issue be that mdadm.conf is missing the information for md0,
and could adding that information prevent data loss or inconsistencies
in the future?
2. Can I (re)create the RAID1 md0 or (re-)add the missing partition in
an easy way that no or at least not all information is lost? If yes, how?

I assume that it might not be possible to sync the data from two
different database versions without data loss. If this assumption is correct, I am willing to use one data set (e. g. the one on dm-31) and discard the other data set (e. g. the one on dm-30). Guides I found so
far describe how to set up a new RAID1 and copy the data from a
partition to the new RAID1. However, perhaps I am wondering if it is possible to (re-)create a RAID1 using just one existing partition (e. g. dm-31) without losing the data on this partition, and then add the other partition to the RAID1?

The databases are backed up regularly. However, the backup is
incremental, and it seems that the different database versions are
messing up the incremental backup, therefore my last valid backup
doesn't include the most recent changes to the database. If it is not possible to salvage the data on one or both of the partitions, I could swallow the bitter pill and go back to a previous database state without unacceptable consequences. However, I would like to try to salvage as
much data as possible.

Thank you in advance,

Paul

I will interpret the above as MariaDB is storing data on files on an XFS
file system on an mdadm RAID1 block device built from two LUKS
containers on two partitions of the same size on two hard disk drives.

AIUI, mdadm RAID protects you when the disk controller is unable to read
a block on one disk. When that happens, mdadm will read other disk(s)
in the array, compute the requested block, and return the requested information to the calling application. (I assume mdadm will also write
the computed block back to the original disk, write a log entry, and
take other actions as designed and configured.)

AIUI neither XFS nor mdadm compute, store, or verify checksums of data
or metadata on disk. So, if a bit, byte, block, etc., changes on disk unexpectedly, neither XFS nor mdadm will know; they will simply use the information on disk. Whatever is looking at that information (e.g.
MariaDB or XFS) may or may not notice the corruption. The user may or
may not notice the corruption.

To protect against corruption in memory, you need error correction code memory. This requires hardware support on the motherboard and memory
modules.

To protect against data corruption on storage, you need a checksumming
storage system. AIUI btrfs and ZFS are the obvious choices on Debian GNU/Linux. Unfortunately, ZFS is not supported OOTB due to licensing conflicts; you must install ZFS. If you choose to do so, it is wise to
also install ZFS on your maintenance/ rescue media.

Over the past several years, I migrated my file storage services and CVS repository from a desktop computer with non-ECC memory, Debian
GNU/Linux, SATA HDD's, LUKS, mdadm RAID1, and ext4 to an entry-level
server with ECC memory, FreeBSD, SAS/SATA HBA's, new SAS HDD's, new SAS
and SATA cables, GELI, and ZFS mirror. The cost was moderate, the
learning curve was non-trivial, and I caused some non-critical data loss
along the way, but now everything is accurate and reliable. I suggest
that you migrate your MariaDB storage similarly.

David

--- PyGate Linux v1.5.17
* Origin: Dragon's Lair, PyGate NNTP<>Fido Gate (3:633/10)

From debian-user@3:633/10 to All on Saturday, June 20, 2026 22:10:01

Subject: Re: How to salvage a degraded mdadm RAID1 with as little data loss as possible?

David Christensen <dpchrist@holgerdanske.com> wrote:

AIUI neither XFS nor mdadm compute, store, or verify checksums of
data or metadata on disk.

XFS checksums metadata for many years now (12?), but it doesn't checksum
user data.

--- PyGate Linux v1.5.17
* Origin: Dragon's Lair, PyGate NNTP<>Fido Gate (3:633/10)

From Robert Heller@3:633/10 to All on Saturday, June 20, 2026 22:30:01

Subject: Re: How to salvage a degraded mdadm RAID1 with as little data loss? as possible?

At Sat, 20 Jun 2026 21:01:32 +0100 debian-user@howorth.org.uk wrote:

David Christensen <dpchrist@holgerdanske.com> wrote:

AIUI neither XFS nor mdadm compute, store, or verify checksums of
data or metadata on disk.

XFS checksums metadata for many years now (12?), but it doesn't checksum
user data.

I've always thought that the hardware controller checksums raw disk blocks (sectors) as part of the low-level I/O processing in the controller hardware's "firmware" and that this is how the controller knows it has a bad block.

--
Robert Heller -- Cell: 413-658-7953 GV: 978-633-5364
Deepwoods Software -- Custom Software Services
http://www.deepsoft.com/ -- Linux Administration Services
heller@deepsoft.com -- Webhosting Services

--- PyGate Linux v1.5.17
* Origin: Dragon's Lair, PyGate NNTP<>Fido Gate (3:633/10)

From Paul Leiber@3:633/10 to All on Sunday, June 21, 2026 01:30:01

Subject: Re: How to salvage a degraded mdadm RAID1 with as little data loss as possible?

Am 20.06.26 um 20:00 schrieb Charles Curley:

On Sat, 20 Jun 2026 18:01:15 +0200
Paul Leiber <paul@onlineschubla.de> wrote:

Somehow, I missed to include the RAID1 information for md0 to the
configuration file (e. g. by entering root@localhost:~# mdadm
--detail --scan /dev/md0 >> /etc/mdadm/mdadm.conf). I am not sure if
this actually is the cause and adding that information would solve
the issue.
My questions are the following:

1. Is my analysis valid in principle? Especially: Could the root
cause for this issue be that mdadm.conf is missing the information
for md0, and could adding that information prevent data loss or
inconsistencies in the future?

I doubt that this is the culprit. the man page for mdadm says, in part:

Assemble
Assemble the components of a previously created array
into an active array. Components can be explicitly given or can
be searched for. mdadm checks that the components do form a
bona fide array, and can, on request, fiddle superblock
information so as to assemble a faulty array.

So mdadm *should* find both devices. But it might not be. And adding
that line will not hurt. I have a similar line in my mdadm.conf.

I built my RAID array up a bit differently that you did yours. You made
your partitions, put LUKS on the partitions, then the RAID on top of
that. I have the partitions, then the RAID array, LUKS on top of that,
then LVM, with file systems on top of the LVs. But I know of no reason
your setup shouldn't work.

I have found that when I have multiple LUKS partitions, giving them
all the same passphrase means I need give only one passphrase to
decrypt on boot.

2. Can I (re)create the RAID1 md0 or (re-)add the missing partition
in an easy way that no or at least not all information is lost? If
yes, how?

Yes. For the gory details see https://oneuptime.com/blog/post/2026-03-02-how-to-replace-a-failed-disk-in-mdadm-raid-on-ubuntu/view.

In short,

* Fail the offending disk. It looks like this has already happened, but
it shouldn't hurt to do it again.

* Remove the disk from the array.

* Add the disk back in again. This should trigger rebuilding, which
takes a while. During the rebuild, the data should be both readable
and writable. You may monitor with:

cat /proc/mdstat

I just noticed that I didn't manage to make clear that (1) I don't think that there is one specific failed partition, but that both partitions containing databases seem to work, but not at the same time, and that (2) I want to keep the data on the seemingly failed device.

Let me explain with an example: I am using KODI to access my video data from different devices. A couple of months ago, I switched KODI to using a centralized database (containing metadata information on movies, watch status, etc.) in order to maintain only one database instead of a database on each device running KODI. The data is stored on the database VM, running MariaDB, which stores the data on the md software RAID1 (at least that's what was supposed to happen). I spent some time configuring the metadata, e.g. correcting mistakes in the movie titles etc. I then noticed that I mistakenly selected English language to display the movie descriptions. Because of family members not fluent in English, I redid the metadata configuration in German. (It was an annoying work, therefore I remember it well.) Then, some time later, after a reboot of the hypervisor (and the database VM) due to kernel updates, the language of the movie descriptions was displayed in English again. I
attributed this to a corrupt database after the database VM reboot and loaded a database backup from some time ago, where the movie description was still in English. So I did the metadata configuration a third time, again in German. (I guess you can imagine the fun I had.) Then, a couple of days ago, after a reboot of the hypervisor and the database VM, the KODI movie description was displayed in English again. That's when I really started digging, because now it was clear that there were actually two intact, but differing databases. (To be clear: There were some other changes to other databases that also were affected in a similar manner which I don't mention in this example, so this issue is not restricted to the KODI database).

Based on the available data, I attribute this issue to the RAID1 which seems to select one of two partitions at random when booting the hypervisor. Indications:
- The last update time in the description of the (seemingly) failed device given by mdadm --examine match the point in time of the switch from one database version ("German") to the other ("English"), therefore I assume that the switch happens at the software RAID level.
- A failure at hardware level doesn't seem likely, because how could there suddenly be an older version of a database available in a RAID1 if one device fails and the RAID1 is degraded, and this after entirely rebuilding the database from a backup? And, mind you, this switch to an older version of the database didn't happen just once, but at least two times. The data (in English) simply shouldn't have been available anymore at this point if the RAID1 had been working as intended.

The most likely explanation to me is that the RAID1 has been running in a degraded state for some time (unnoticed by me), the database changes (e. g. from English to German) were stored to just one of the two partitions, and at some point the RAID1 switched to the other partition after a reboot, containing intact, but older (e. g. English) data. As a defective hardware doesn't seem likely, I assume that something in my setup causes this behaviour by md. But of course, I might be wrong and I am open to other explanations. For example, what my assumption fails to explain is why the switch only happens from time to time, and not more often, e.g. after each reboot.

The example you kindly give is for removing a seemingly failed partition (currently dm-30, "German" database) from a md RAID1, keeping the data on an intact partition (currently dm-31, "English" database) and than re-fgadding a partition to the RAID1. This is pretty straightforward: the data is kept and replicated from the valid partition to the freshly added one. However, in my case, the dataset I want to keep is on the seemingly failed partition not used in the RAID (currently dm-30, "German").

Options I see (besides recreating the RAID1 from scratch and using an available backup to restore the data, losing some data):

1. I could fail the seemingly intact partition or remove the RAID1 entirely, somehow use the seemingly failed partition (dm-30, "German") to create a new RAID without losing the data on it, then add the other partition (dm-31) as a new drive and have the data replicated. I am not sure if this is possible, therefore my question to this list.

2. Another option is to reboot the hypervisor and hope for a switch of the RAID to the partition containing the more recent version of the database, then follow your guide. But I am not really confident that such a "strategy" is the best choice I have at the moment. Also, I just tried a reboot three times, each time the data in the database is the wrong, old one.

3. I could also backup the database from the seemingly failed partition in order to not lose data and then use this backup to recreate the RAID1, but I would need to mount that partition, which ended in an error when I tried it.

And, of course, I don't want this to happen again, therefore I want to find the root cause for this situation and fix it. If it is not the missing information in /etc/mdadm/mdadm.conf, what else could it be?

Sorry for the lengthy posts, I don't know how to describe this situation clearly in a shorter way.

--- PyGate Linux v1.5.17
* Origin: Dragon's Lair, PyGate NNTP<>Fido Gate (3:633/10)

From nwe@3:633/10 to All on Sunday, June 21, 2026 04:20:01

Subject: Re: How to salvage a degraded mdadm RAID1 with as little data loss as possible?

On 6/20/26 6:23 PM, Paul Leiber wrote:

I just noticed that I didn't manage to make clear that (1) I don't
think that there is one specific failed partition, but that both
partitions containing databases seem to work, but not at the same
time, and that (2) I want to keep the data on the seemingly failed device.

In case this helps: I think what you are trying to describe is "split
brain error" https://en.wikipedia.org/wiki/Split-brain_(computing)

In short you have two versions, each with separate up-to-date data,
which you will want to merge. Maybe there is someone here who knows a
good way to do this. I currently have no experience working with such an error, but have read about it some.

--- PyGate Linux v1.5.17
* Origin: Dragon's Lair, PyGate NNTP<>Fido Gate (3:633/10)

From David Christensen@3:633/10 to All on Sunday, June 21, 2026 06:50:01

Subject: Re: How to salvage a degraded mdadm RAID1 with as little data loss as possible?

On 6/20/26 13:01, debian-user@howorth.org.uk wrote:

David Christensen <dpchrist@holgerdanske.com> wrote:

AIUI neither XFS nor mdadm compute, store, or verify checksums of
data or metadata on disk.

XFS checksums metadata for many years now (12?), but it doesn't
checksum user data.

Thank you for the clarification:

https://wiki.archlinux.org/title/XFS#Checksumming

On 6/20/26 13:26, Robert Heller wrote:

I've always thought that the hardware controller checksums raw disk
blocks (sectors) as part of the low-level I/O processing in the
controller hardware's "firmware" and that this is how the controller
knows it has a bad block.

That is also my understanding. The HDD's I own are proprietary, so the engineering documentation is unavailable. But, I seem to recall reading
an article on the WWW stating that HDD's store additional data in hidden blocks on the media to allow detecting and/or correcting several flipped
bits.

On 6/20/26 16:23, Paul Leiber wrote:

I just noticed that I didn't manage to make clear that (1) I don't
think that there is one specific failed partition, but that both
partitions containing databases seem to work, but not at the same
time, and that (2) I want to keep the data on the seemingly failed
device.

Let me explain with an example: I am using KODI to access my video
data from different devices. A couple of months ago, I switched KODI
to using a centralized database (containing metadata information on
movies, watch status, etc.) in order to maintain only one database
instead of a database on each device running KODI. The data is
stored on the database VM, running MariaDB, which stores the data on
the md software RAID1 (at least that's what was supposed to happen).
I spent some time configuring the metadata, e.g. correcting mistakes
in the movie titles etc. I then noticed that I mistakenly selected
English language to display the movie descriptions. Because of
family members not fluent in English, I redid the metadata
configuration in German. (It was an annoying work, therefore I
remember it well.) Then, some time later, after a reboot of the
hypervisor (and the database VM) due to kernel updates, the
language of the movie descriptions was displayed in English again. I attributed this to a corrupt database after the database VM reboot
and loaded a database backup from some time ago, where the movie
description was still in English. So I did the metadata
configuration a third time, again in German. (I guess you can
imagine the fun I had.) Then, a couple of days ago, after a reboot
of the hypervisor and the database VM, the KODI movie description
was displayed in English again. That's when I really started
digging, because now it was clear that there were actually two
intact, but differing databases. (To be clear: There were some other
changes to other databases that also were affected in a similar
manner which I don't mention in this example, so this issue is not restricted to the KODI database).

Based on the available data, I attribute this issue to the RAID1
which seems to select one of two partitions at random when booting
the hypervisor. Indications: - The last update time in the
description of the (seemingly) failed device given by mdadm --
examine match the point in time of the switch from one database
version ("German") to the other ("English"), therefore I assume that
the switch happens at the software RAID level. - A failure at
hardware level doesn't seem likely, because how could there suddenly
be an older version of a database available in a RAID1 if one device
fails and the RAID1 is degraded, and this after entirely rebuilding
the database from a backup? And, mind you, this switch to an older
version of the database didn't happen just once, but at least two
times. The data (in English) simply shouldn't have been available
anymore at this point if the RAID1 had been working as intended.

The most likely explanation to me is that the RAID1 has been running
in a degraded state for some time (unnoticed by me), the database
changes (e. g. from English to German) were stored to just one of
the two partitions, and at some point the RAID1 switched to the
other partition after a reboot, containing intact, but older (e. g.
English) data. As a defective hardware doesn't seem likely, I assume
that something in my setup causes this behaviour by md. But of
course, I might be wrong and I am open to other explanations. For
example, what my assumption fails to explain is why the switch only
happens from time to time, and not more often, e.g. after each
reboot.

The example you kindly give is for removing a seemingly failed
partition (currently dm-30, "German" database) from a md RAID1,
keeping the data on an intact partition (currently dm-31, "English" database) and than re-fgadding a partition to the RAID1. This is
pretty straightforward: the data is kept and replicated from the
valid partition to the freshly added one. However, in my case, the
dataset I want to keep is on the seemingly failed partition not used
in the RAID (currently dm-30, "German").

Options I see (besides recreating the RAID1 from scratch and using
an available backup to restore the data, losing some data):

1. I could fail the seemingly intact partition or remove the RAID1
entirely, somehow use the seemingly failed partition (dm-30,
"German") to create a new RAID without losing the data on it, then
add the other partition (dm-31) as a new drive and have the data
replicated. I am not sure if this is possible, therefore my question
to this list.

2. Another option is to reboot the hypervisor and hope for a switch
of the RAID to the partition containing the more recent version of
the database, then follow your guide. But I am not really confident
that such a "strategy" is the best choice I have at the moment.
Also, I just tried a reboot three times, each time the data in the
database is the wrong, old one.

3. I could also backup the database from the seemingly failed
partition in order to not lose data and then use this backup to
recreate the RAID1, but I would need to mount that partition, which
ended in an error when I tried it.

And, of course, I don't want this to happen again, therefore I want
to find the root cause for this situation and fix it. If it is not
the missing information in /etc/mdadm/mdadm.conf, what else could it
be?

Sorry for the lengthy posts, I don't know how to describe this
situation clearly in a shorter way.

Do you own a power supply tester? If not, I suggest buying one. When
you have one, test your power supply. It is possible for one rail to
fail (e.g. +12 VDC), the computer to boot, and all or part of the
computer to operate incorrectly. Without a power supply tester, you
will be chasing seemingly random errors until the power supply goes to
100% failure and/or you damage/ destroy other hardware.

Does your computer have ECC memory? If not, I suggest getting a
computer with ECC memory. In any case, I suggest testing your memory
with memtest86+ for 24 hours.

Have you tested your hard disks? If not, I suggest running smartctl(8) "--test long". When testing is done, view the results with "--xall".

Have you validated the filesystem with fsck.xfs(8)? If not, I suggest
doing so.

Do you have streams of database transactions since the last known good backups? If so, can they be replayed?

Can you switch the databases to read-only, shutdown, disconnect the
first disk, boot, backup the database(s), shutdown, connect the first
disk, disconnect the second disk, boot, backup the database(s),
shutdown, and connect the second disk? If so, you could then restore
those backups, and the last known good backups, (with different names, read-only) and trouble-shoot. It may or may not be possible to identify
the newest data and implement queries/ scripts to do the three-way merge.

David

--- PyGate Linux v1.5.17
* Origin: Dragon's Lair, PyGate NNTP<>Fido Gate (3:633/10)

From David Christensen@3:633/10 to All on Sunday, June 21, 2026 07:00:01

Subject: Re: How to salvage a degraded mdadm RAID1 with as little data loss as possible?

On 6/20/26 21:47, David Christensen wrote:

Can you switch the databases to read-only, shutdown, disconnect the
first disk, boot, backup the database(s), shutdown, connect the first
disk, disconnect the second disk, boot, backup the database(s),
shutdown, and connect the second disk?� If so, you could then restore
those backups, and the last known good backups, (with different names, read-only

Correction -- add ", on a known good database server".

) and trouble-shoot.� It may or may not be possible to identify
the newest data and implement queries/ scripts to do the three-way merge.

David

--- PyGate Linux v1.5.17
* Origin: Dragon's Lair, PyGate NNTP<>Fido Gate (3:633/10)

From tomas@3:633/10 to All on Sunday, June 21, 2026 07:50:01

Subject: Re: How to salvage a degraded mdadm RAID1 with as little data loss as possible?

On Sat, Jun 20, 2026 at 09:47:23PM -0700, David Christensen wrote:
[...]

That is also my understanding. The HDD's I own are proprietary, so the engineering documentation is unavailable. But, I seem to recall reading an article on the WWW stating that HDD's store additional data in hidden blocks on the media to allow detecting and/or correcting several flipped bits.

Much more than that. They do accept some error rate which is corrected
(it has long been an engineering tradeoff: more density -> higher error
rate -> better error correction codes):
"For example, a typical 1 TB hard disk with 512-byte sectors provides
additional capacity of about 93 GB for the ECC data."
From: https://en.wikipedia.org/wiki/Hard_disk#Error_rates_and_handling Wikipedia: better than "the WWW". Especially in these times.
Cheers
--
t

--- PyGate Linux v1.5.17
* Origin: Dragon's Lair, PyGate NNTP<>Fido Gate (3:633/10)

From Charles Curley@3:633/10 to All on Sunday, June 21, 2026 15:10:02

Subject: Re: How to salvage a degraded mdadm RAID1 with as little data loss? as possible?

On Sat, 20 Jun 2026 16:26:59 -0400 (EDT)
Robert Heller <heller@deepsoft.com> wrote:

I've always thought that the hardware controller checksums raw disk
blocks (sectors) as part of the low-level I/O processing in the
controller hardware's "firmware" and that this is how the controller
knows it has a bad block.

Correct.

Each sector has the data, and enough checksum data that the controller
can test the data for integrity, and correct some small errors (a few
bytes, typically).

When the controller detects an error it attempts to correct it. If it
succeeds, it returns the data to the computer and re-writes the data to
the platter. If that write fails, it marks the sector as bad, and
selects a spare sector to replace it.

If the attempt to correct the error fails, I believe the drive reports
the error to the computer, and allocates a spare sector to replace the
failed one. It is then up to the OS or even application software to
generate a replacement sector or otherwise handle the problem.

These days the controller runs surface tests in the background to
detect and correct errors before they get too big to correct. In
addition, you can run an extended self-test, which includes a surface
test, with SMART software.

There is a fixed supply of spare sectors to be allocated; when that is exhausted the OS starts marking bad blocks. These days that means it is
time to replace the drive. This is why it is important to monitor
drives for these (and other) failures with SMART software.

--
Does anybody read signatures any more?

https://charlescurley.com
https://charlescurley.com/blog/

--- PyGate Linux v1.5.17
* Origin: Dragon's Lair, PyGate NNTP<>Fido Gate (3:633/10)

From Michel Verdier@3:633/10 to All on Sunday, June 21, 2026 16:20:01

Subject: Re: How to salvage a degraded mdadm RAID1 with as little data loss as possible?

On 2026-06-21, Paul Leiber wrote:

I just noticed that I didn't manage to make clear that (1) I don't think that there is one specific failed partition, but that both partitions containing databases seem to work, but not at the same time, and that (2) I want to keep the data on the seemingly failed device.

[...]

I used raid for tens of years and never had the problem you described. So
I believe it should come from something else than raid failure. raid
duplicates data on both partitions. If you don't have a failed partition
the duplication is done transparently and quickly (<second). When a
partition fails you can add a new and clean partition and the original
one, the partition still in raid, is synchronized into it. As others told
you you should use smartd to monitor your disks.

--- PyGate Linux v1.5.17
* Origin: Dragon's Lair, PyGate NNTP<>Fido Gate (3:633/10)

From Paul Leiber@3:633/10 to All on Sunday, June 21, 2026 21:50:01

Subject: Re: How to salvage a degraded mdadm RAID1 with as little data loss as possible?

Am 21.06.26 um 06:47 schrieb David Christensen:

On 6/20/26 13:01, debian-user@howorth.org.uk wrote:

David Christensen <dpchrist@holgerdanske.com> wrote:

AIUI neither XFS nor mdadm compute, store, or verify checksums of data or metadata on disk.

XFS checksums metadata for many years now (12?), but it doesn't checksum user data.

Thank you for the clarification:

https://wiki.archlinux.org/title/XFS#Checksumming

On 6/20/26 13:26, Robert Heller wrote:

I've always thought that the hardware controller checksums raw disk blocks (sectors) as part of the low-level I/O processing in the controller hardware's "firmware" and that this is how the controller
knows it has a bad block.

That is also my understanding.� The HDD's I own are proprietary, so the engineering documentation is unavailable.� But, I seem to recall reading an article on the WWW stating that HDD's store additional data in hidden blocks on the media to allow detecting and/or correcting several flipped bits.

On 6/20/26 16:23, Paul Leiber wrote:

I just noticed that I didn't manage to make clear that (1) I don't think that there is one specific failed partition, but that both partitions containing databases seem to work, but not at the same time, and that (2) I want to keep the data on the seemingly failed device.

Let me explain with an example: I am using KODI to access my video data from different devices. A couple of months ago, I switched KODI to using a centralized database (containing metadata information on movies, watch status, etc.) in order to maintain only one database instead of a database on each device running KODI. The data is stored on the database VM, running MariaDB, which stores the data on the md software RAID1 (at least that's what was supposed to happen). I spent some time configuring the metadata, e.g. correcting mistakes in the movie titles etc. I then noticed that I mistakenly selected English language to display the movie descriptions. Because of family members not fluent in English, I redid the metadata configuration in German. (It was an annoying work, therefore I remember it well.) Then, some time later, after a reboot of the hypervisor (and the database VM) due to kernel updates, the
language of the movie descriptions was displayed in English again. I attributed this to a corrupt database after the database VM reboot and loaded a database backup from some time ago, where the movie description was still in English. So I did the metadata configuration a third time, again in German. (I guess you can imagine the fun I had.) Then, a couple of days ago, after a reboot of the hypervisor and the database VM, the KODI movie description was displayed in English again. That's when I really started digging, because now it was clear that there were actually two intact, but differing databases. (To be clear: There were some other changes to other databases that also were affected in a similar manner which I don't mention in this example, so this issue is not restricted to the KODI database).

Based on the available data, I attribute this issue to the RAID1 which seems to select one of two partitions at random when booting the hypervisor. Indications: - The last update time in the description of the (seemingly) failed device given by mdadm -- examine match the point in time of the switch from one database version ("German") to the other ("English"), therefore I assume that the switch happens at the software RAID level. - A failure at hardware level doesn't seem likely, because how could there suddenly be an older version of a database available in a RAID1 if one device fails and the RAID1 is degraded, and this after entirely rebuilding the database from a backup? And, mind you, this switch to an older version of the database didn't happen just once, but at least two times. The data (in English) simply shouldn't have been available anymore at this point if the RAID1 had been working as intended.

The most likely explanation to me is that the RAID1 has been running in a degraded state for some time (unnoticed by me), the database changes (e. g. from English to German) were stored to just one of the two partitions, and at some point the RAID1 switched to the other partition after a reboot, containing intact, but older (e. g. English) data. As a defective hardware doesn't seem likely, I assume that something in my setup causes this behaviour by md. But of course, I might be wrong and I am open to other explanations. For example, what my assumption fails to explain is why the switch only happens from time to time, and not more often, e.g. after each reboot.

The example you kindly give is for removing a seemingly failed partition (currently dm-30, "German" database) from a md RAID1, keeping the data on an intact partition (currently dm-31, "English" database) and than re-fgadding a partition to the RAID1. This is pretty straightforward: the data is kept and replicated from the valid partition to the freshly added one. However, in my case, the dataset I want to keep is on the seemingly failed partition not used in the RAID (currently dm-30, "German").

Options I see (besides recreating the RAID1 from scratch and using an available backup to restore the data, losing some data):

1. I could fail the seemingly intact partition or remove the RAID1 entirely, somehow use the seemingly failed partition (dm-30, "German") to create a new RAID without losing the data on it, then add the other partition (dm-31) as a new drive and have the data replicated. I am not sure if this is possible, therefore my question to this list.

2. Another option is to reboot the hypervisor and hope for a switch of the RAID to the partition containing the more recent version of the database, then follow your guide. But I am not really confident that such a "strategy" is the best choice I have at the moment. Also, I just tried a reboot three times, each time the data in the database is the wrong, old one.

3. I could also backup the database from the seemingly failed partition in order to not lose data and then use this backup to recreate the RAID1, but I would need to mount that partition, which ended in an error when I tried it.

And, of course, I don't want this to happen again, therefore I want to find the root cause for this situation and fix it. If it is not the missing information in /etc/mdadm/mdadm.conf, what else could it be?

Sorry for the lengthy posts, I don't know how to describe this situation clearly in a shorter way.

Do you own a power supply tester?� If not, I suggest buying one. When you have one, test your power supply.� It is possible for one rail to fail (e.g. +12 VDC), the computer to boot, and all or part of the computer to operate incorrectly.� Without a power supply tester, you will be chasing seemingly random errors until the power supply goes to 100% failure and/or you damage/ destroy other hardware.

There are two identical hard drives (manufacturer, type, size) in my system used for data storage. Both have an identical layout: One large partition used as BTRFS RAID1 (duplicated data and metadata), one small partition which is used for the md software RAID1. BTRFS is working fine, it's just the md software RAID that is not working correctly. I can't rule out an issue with power supply, but it seems unlikely, considering that BTRFS is not having any issue. But I will put a test of the power supply on the list of things to try.

Does your computer have ECC memory?� If not, I suggest getting a computer with ECC memory.� In any case, I suggest testing your memory with memtest86+ for 24 hours.

Yes, it does have ECC memory. I will put the memory test on the list as well.

Have you tested your hard disks?� If not, I suggest running smartctl(8) "--test long".� When testing is done, view the results with "--xall".

Both disks are monitored via smartctl. Automated short and long tests are being done regularly. There are no indications for hardware failure in the smart data.

Have you validated the filesystem with fsck.xfs(8)?� If not, I suggest doing so.

I just did a check of the file system (using xfs_repair -n), with no errors reported.

Do you have streams of database transactions since the last known good backups?� If so, can they be replayed?

Not exactly knowing what such a stream is, I guess I don't have one. But I am not sure. Will check.

Can you switch the databases to read-only, shutdown, disconnect the first disk, boot, backup the database(s), shutdown, connect the first disk, disconnect the second disk, boot, backup the database(s), shutdown, and connect the second disk?� If so, you could then restore those backups, and the last known good backups, (with different names, read-only) and trouble-shoot.� It may or may not be possible to identify the newest data and implement queries/ scripts to do the three-way merge.

That's a good suggestion. I will need to check what booting with just one disk could do to the BTRFS filesystem, but this might be a way to force md to use the disk which is currently indicated as failed.

--- PyGate Linux v1.5.17
* Origin: Dragon's Lair, PyGate NNTP<>Fido Gate (3:633/10)

From Paul Leiber@3:633/10 to All on Sunday, June 21, 2026 22:10:01

Subject: Re: How to salvage a degraded mdadm RAID1 with as little data loss as possible?

Am 21.06.26 um 16:12 schrieb Michel Verdier:

On 2026-06-21, Paul Leiber wrote:

I just noticed that I didn't manage to make clear that (1) I don't think that
there is one specific failed partition, but that both partitions containing >> databases seem to work, but not at the same time, and that (2) I want to keep
the data on the seemingly failed device.

[...]

I used raid for tens of years and never had the problem you described. So
I believe it should come from something else than raid failure. raid duplicates data on both partitions. If you don't have a failed partition
the duplication is done transparently and quickly (<second). When a
partition fails you can add a new and clean partition and the original
one, the partition still in raid, is synchronized into it. As others told
you you should use smartd to monitor your disks.

My knowledge in IT is limited. I just can describe what I can observe and make guesses. (The md RAID is part of a setup I do for fun at home.) I know that it sounds strange, but my best guess is that there are two differing databases stored on my hard drives. How else can the repeated switch between different data sets be explained?

To comment on your suggestion to monitor hardware status of the disks: Both disks are monitored using smartd (short and long tests being conducted on a regular basis), smartctl doesn't indicate any issue with the drives.

--- PyGate Linux v1.5.17
* Origin: Dragon's Lair, PyGate NNTP<>Fido Gate (3:633/10)

From Paul Leiber@3:633/10 to All on Sunday, June 21, 2026 23:50:01

Subject: Re: How to salvage a degraded mdadm RAID1 with as little data loss as possible?

Am 21.06.26 um 01:23 schrieb Paul Leiber:

Am 20.06.26 um 20:00 schrieb Charles Curley:

On Sat, 20 Jun 2026 18:01:15 +0200
Paul Leiber <paul@onlineschubla.de> wrote:

Somehow, I missed to include the RAID1 information for md0 to the
configuration file (e. g. by entering root@localhost:~# mdadm
--detail --scan /dev/md0 >> /etc/mdadm/mdadm.conf). I am not sure if
this actually is the cause and adding that information would solve
the issue.
My questions are the following:

1. Is my analysis valid in principle? Especially: Could the root
cause for this issue be that mdadm.conf is missing the information
for md0, and could adding that information prevent data loss or
inconsistencies in the future?

I doubt that this is the culprit. the man page for mdadm says, in part:

�� Assemble
�� Assemble the components of a previously created array
�� into an active array.� Components can be explicitly given or can
�� be searched for.� mdadm checks� that� the components do form a
�� bona fide array, and can, on request, fiddle superblock
�� information so as to assemble a faulty array.

So mdadm *should* find both devices. But it might not be. And adding
that line will not hurt. I have a similar line in my mdadm.conf.

I built my RAID array up a bit differently that you did yours. You made
your partitions, put LUKS on the partitions, then the RAID on top of
that. I have the partitions, then the RAID array, LUKS on top of that,
then LVM, with file systems on top of the LVs. But I know of no reason
your setup shouldn't work.

I have found that when I have multiple LUKS partitions, giving them
all the same passphrase means I need give only one passphrase to
decrypt on boot.

2. Can I (re)create the RAID1 md0 or (re-)add the missing partition
in an easy way that no or at least not all information is lost? If
yes, how?

Yes. For the gory details see
https://oneuptime.com/blog/post/2026-03-02-how-to-replace-a-failed-disk-in-mdadm-raid-on-ubuntu/view.

In short,

* Fail the offending disk. It looks like this has already happened, but
�� it shouldn't hurt to do it again.

* Remove the disk from the array.

* Add the disk back in again. This should trigger rebuilding, which
�� takes a while. During the rebuild, the data should be both readable
�� and writable. You may monitor with:

�� cat /proc/mdstat

I just noticed that I didn't manage to make clear that (1) I don't think that there is one specific failed partition, but that both partitions containing databases seem to work, but not at the same time, and that (2) I want to keep the data on the seemingly failed device.

Let me explain with an example: I am using KODI to access my video data from different devices. A couple of months ago, I switched KODI to using a centralized database (containing metadata information on movies, watch status, etc.) in order to maintain only one database instead of a database on each device running KODI. The data is stored on the database VM, running MariaDB, which stores the data on the md software RAID1 (at least that's what was supposed to happen). I spent some time configuring the metadata, e.g. correcting mistakes in the movie titles etc. I then noticed that I mistakenly selected English language to display the movie descriptions. Because of family members not fluent in English, I redid the metadata configuration in German. (It was an annoying work, therefore I remember it well.) Then, some time later, after a reboot of the hypervisor (and the database VM) due to kernel updates, the language of the movie descriptions was displayed in English again. I
attributed this to a corrupt database after the database VM reboot and loaded a database backup from some time ago, where the movie description was still in English. So I did the metadata configuration a third time, again in German. (I guess you can imagine the fun I had.) Then, a couple of days ago, after a reboot of the hypervisor and the database VM, the KODI movie description was displayed in English again. That's when I really started digging, because now it was clear that there were actually two intact, but differing databases. (To be clear: There were some other changes to other databases that also were affected in a similar manner which I don't mention in this example, so this issue is not restricted to the KODI database).

Based on the available data, I attribute this issue to the RAID1 which seems to select one of two partitions at random when booting the hypervisor. Indications:
- The last update time in the description of the (seemingly) failed device given by mdadm --examine match the point in time of the switch from one database version ("German") to the other ("English"), therefore I assume that the switch happens at the software RAID level.
- A failure at hardware level doesn't seem likely, because how could there suddenly be an older version of a database available in a RAID1 if one device fails and the RAID1 is degraded, and this after entirely rebuilding the database from a backup? And, mind you, this switch to an older version of the database didn't happen just once, but at least two times. The data (in English) simply shouldn't have been available anymore at this point if the RAID1 had been working as intended.

The most likely explanation to me is that the RAID1 has been running in a degraded state for some time (unnoticed by me), the database changes (e. g. from English to German) were stored to just one of the two partitions, and at some point the RAID1 switched to the other partition after a reboot, containing intact, but older (e. g. English) data. As a defective hardware doesn't seem likely, I assume that something in my setup causes this behaviour by md. But of course, I might be wrong and I am open to other explanations. For example, what my assumption fails to explain is why the switch only happens from time to time, and not more often, e.g. after each reboot.

The example you kindly give is for removing a seemingly failed partition (currently dm-30, "German" database) from a md RAID1, keeping the data on an intact partition (currently dm-31, "English" database) and than re-fgadding a partition to the RAID1. This is pretty straightforward: the data is kept and replicated from the valid partition to the freshly added one. However, in my case, the dataset I want to keep is on the seemingly failed partition not used in the RAID (currently dm-30, "German").

Options I see (besides recreating the RAID1 from scratch and using an available backup to restore the data, losing some data):

1. I could fail the seemingly intact partition or remove the RAID1 entirely, somehow use the seemingly failed partition (dm-30, "German") to create a new RAID without losing the data on it, then add the other partition (dm-31) as a new drive and have the data replicated. I am not sure if this is possible, therefore my question to this list.

2. Another option is to reboot the hypervisor and hope for a switch of the RAID to the partition containing the more recent version of the database, then follow your guide. But I am not really confident that such a "strategy" is the best choice I have at the moment. Also, I just tried a reboot three times, each time the data in the database is the wrong, old one.

3. I could also backup the database from the seemingly failed partition in order to not lose data and then use this backup to recreate the RAID1, but I would need to mount that partition, which ended in an error when I tried it.

And, of course, I don't want this to happen again, therefore I want to find the root cause for this situation and fix it. If it is not the missing information in /etc/mdadm/mdadm.conf, what else could it be?

Sorry for the lengthy posts, I don't know how to describe this situation clearly in a shorter way.

I managed to rebuild the md RAID1 using the data on the seemingly failed device (partition_1). First, I did a dd dump of partition_2 (currently in usage) in order not to lose data. Then, I recreated the RAID1 using the data from partition_1:

mdadm --stop /dev/md0 # This stops the degraded RAID1
mdadm --assemble --update=uuid /dev/md0 /dev/mapper/partition_1 # This creates a new RAID1 using the partition_1, a new array UUID is required in order for --assembly to work
mdadm --manage --add /dev/md0 /dev/mapper/partition_2 # This adds partition_2 to the RAID1, contents of partition_1 are replicated to partition_2 automatically

md is currently replicating the data (German movie descriptions in Kodi, yay!) from partition_1 to partition_2. I might have to turn partition_2 from "spare" to "active", but I'll let the replication complete first.

In any case, I set up mdmonitor to alert me if the RAID1 degrades again. That's something I should have thought of earlier.

We'll see if this issue occurs again. I'll give an update if this is the case.

Thanks to everybody for trying to help me!

Paul

--- PyGate Linux v1.5.17
* Origin: Dragon's Lair, PyGate NNTP<>Fido Gate (3:633/10)

From Michel Verdier@3:633/10 to All on Monday, June 22, 2026 07:40:01

Subject: Re: How to salvage a degraded mdadm RAID1 with as little data loss as possible?

On 2026-06-21, Paul Leiber wrote:

My knowledge in IT is limited. I just can describe what I can observe and make
guesses. (The md RAID is part of a setup I do for fun at home.) I know that it
sounds strange, but my best guess is that there are two differing databases stored on my hard drives. How else can the repeated switch between different data sets be explained?

I think you should investigate your database installation. The problem
you described cannot come from raid1. It could be database installed on "partition" and not on "md array". Or a backup reverted. Or something
else.

--- PyGate Linux v1.5.17
* Origin: Dragon's Lair, PyGate NNTP<>Fido Gate (3:633/10)

From Michel Verdier@3:633/10 to All on Monday, June 22, 2026 08:00:01

Subject: Re: How to salvage a degraded mdadm RAID1 with as little data loss as possible?

On 2026-06-21, Paul Leiber wrote:

I managed to rebuild the md RAID1 using the data on the seemingly failed device (partition_1). First, I did a dd dump of partition_2 (currently in usage) in order not to lose data. Then, I recreated the RAID1 using the data from partition_1:

mdadm --stop /dev/md0 # This stops the degraded RAID1
mdadm --assemble --update=uuid /dev/md0 /dev/mapper/partition_1 # This creates a new RAID1 using the partition_1, a new array UUID is required in order for --assembly to work
mdadm --manage --add /dev/md0 /dev/mapper/partition_2 # This adds partition_2 to the RAID1, contents of partition_1 are replicated to partition_2 automatically

Your commands are strange : the partitions should be the disk partitions
from /dev and not mapped ones. Or you have another layer ? From where
come /dev/mapper/partition_1 and /dev/mapper/partition_2 ?

Beside this it is much quicker and safer to go this way :
- do not stop the md array (and thus the assemble is not needed)
- remove the failed partition
mdadm --manage /dev/md0 --remove "failed partition"
- add the new clean partition
mdadm --manage /dev/md0 --add "good partition"
- and let mdadm sync the array

--- PyGate Linux v1.5.17
* Origin: Dragon's Lair, PyGate NNTP<>Fido Gate (3:633/10)

From tomas@3:633/10 to All on Monday, June 22, 2026 08:10:01

Subject: Re: How to salvage a degraded mdadm RAID1 with as little data loss as possible?

On Mon, Jun 22, 2026 at 07:50:10AM +0200, Michel Verdier wrote:

On 2026-06-21, Paul Leiber wrote:

I managed to rebuild the md RAID1 using the data on the seemingly failed device (partition_1). First, I did a dd dump of partition_2 (currently in usage) in order not to lose data. Then, I recreated the RAID1 using the data
from partition_1:

mdadm --stop /dev/md0 # This stops the degraded RAID1
mdadm --assemble --update=uuid /dev/md0 /dev/mapper/partition_1 # This creates a new RAID1 using the partition_1, a new array UUID is required in order for --assembly to work
mdadm --manage --add /dev/md0 /dev/mapper/partition_2 # This adds partition_2 to the RAID1, contents of partition_1 are replicated to partition_2 automatically

Your commands are strange : the partitions should be the disk partitions
from /dev and not mapped ones. Or you have another layer ? From where
come /dev/mapper/partition_1 and /dev/mapper/partition_2 ?

If I have been following along, the RAID parts are LUKS encrypted devices,
so to me it does make sense.
Cheers
--
t

--- PyGate Linux v1.5.17
* Origin: Dragon's Lair, PyGate NNTP<>Fido Gate (3:633/10)

From Michel Verdier@3:633/10 to All on Monday, June 22, 2026 08:20:01

Subject: Re: How to salvage a degraded mdadm RAID1 with as little data loss as possible?

On 2026-06-22, tomas@tuxteam.de wrote:

Your commands are strange : the partitions should be the disk partitions
from /dev and not mapped ones. Or you have another layer ? From where
come /dev/mapper/partition_1 and /dev/mapper/partition_2 ?

If I have been following along, the RAID parts are LUKS encrypted devices,
so to me it does make sense.

RAID is always before LUKS : partition > RAID array > LUKS > filesystem

--- PyGate Linux v1.5.17
* Origin: Dragon's Lair, PyGate NNTP<>Fido Gate (3:633/10)

From tomas@3:633/10 to All on Monday, June 22, 2026 09:40:01

Subject: Re: How to salvage a degraded mdadm RAID1 with as little data loss as possible?

On Mon, Jun 22, 2026 at 08:18:45AM +0200, Michel Verdier wrote:

On 2026-06-22, tomas@tuxteam.de wrote:

Your commands are strange : the partitions should be the disk partitions >> from /dev and not mapped ones. Or you have another layer ? From where
come /dev/mapper/partition_1 and /dev/mapper/partition_2 ?

If I have been following along, the RAID parts are LUKS encrypted devices, so to me it does make sense.

RAID is always before LUKS : partition > RAID array > LUKS > filesystem

"Is always" means for you "should always be" or "has to be"?
As far as I understand OP, their case is the other way around (and I don't
see why it shouldn't be technically possible: a block device is a block
device is a block device, after all).
Cheers
--
t

--- PyGate Linux v1.5.17
* Origin: Dragon's Lair, PyGate NNTP<>Fido Gate (3:633/10)

From Paul Leiber@3:633/10 to All on Monday, June 22, 2026 09:50:01

Subject: Re: How to salvage a degraded mdadm RAID1 with as little data loss as possible?

Am 22.06.26 um 07:50 schrieb Michel Verdier:

On 2026-06-21, Paul Leiber wrote:

I managed to rebuild the md RAID1 using the data on the seemingly failed
device (partition_1). First, I did a dd dump of partition_2 (currently in
usage) in order not to lose data. Then, I recreated the RAID1 using the data >> from partition_1:

mdadm --stop /dev/md0 # This stops the degraded RAID1
mdadm --assemble --update=uuid /dev/md0 /dev/mapper/partition_1 # This creates a new RAID1 using the partition_1, a new array UUID is required in order for --assembly to work
mdadm --manage --add /dev/md0 /dev/mapper/partition_2 # This adds partition_2 to the RAID1, contents of partition_1 are replicated to partition_2 automatically

Your commands are strange : the partitions should be the disk partitions
from /dev and not mapped ones. Or you have another layer ? From where
come /dev/mapper/partition_1 and /dev/mapper/partition_2 ?

The partitions are LUKS encrpyted and hence decrypted before being assembled into the RAID array. Thus the mapped drives.

Beside this it is much quicker and safer to go this way :
- do not stop the md array (and thus the assemble is not needed)
- remove the failed partition
mdadm --manage /dev/md0 --remove "failed partition"
- add the new clean partition
mdadm --manage /dev/md0 --add "good partition"
- and let mdadm sync the array

There was a partition which md claimed was failed (partition_1) which contained the newer database. There was a partition the degraded array was using (partition_2) which contained the older database. I wanted to keep the newer database on the seemingly failed device.

I was not sure if

1) it is possible to remove all devices from a md array (removing partition_2 would have resulted in an array without any partition)
2) adding a device with data on it I want to keep (partition_1) to an array is possible without losing the data (adding partition_2 to the freshly created md array resulted in the data on partition_2 being overwritten, as intended, but I wanted to avoid that partition_1 is overwritten with data from partition 2)

Hence I decided to reassemble the array starting with partition_1 and adding partition_2.

Are you positive that the procedure you recommend would have ended in the same result?

--- PyGate Linux v1.5.17
* Origin: Dragon's Lair, PyGate NNTP<>Fido Gate (3:633/10)

From Paul Leiber@3:633/10 to All on Monday, June 22, 2026 10:10:01

Subject: Re: How to salvage a degraded mdadm RAID1 with as little data loss as possible?

Am 22.06.26 um 09:35 schrieb tomas@tuxteam.de:

On Mon, Jun 22, 2026 at 08:18:45AM +0200, Michel Verdier wrote:

On 2026-06-22, tomas@tuxteam.de wrote:

Your commands are strange : the partitions should be the disk partitions >>>> from /dev and not mapped ones. Or you have another layer ? From where
come /dev/mapper/partition_1 and /dev/mapper/partition_2 ?

If I have been following along, the RAID parts are LUKS encrypted devices, >>> so to me it does make sense.

RAID is always before LUKS : partition > RAID array > LUKS > filesystem

"Is always" means for you "should always be" or "has to be"?

As far as I understand OP, their case is the other way around (and I don't see why it shouldn't be technically possible: a block device is a block device is a block device, after all).

Tomas' description of my setup is correct, LUKS before RAID. It has been working in the past, and it is working right now again. Is this type of setup recommended? I don't know. BTRFS doesn't show any issues with this setup.

However, my main suspect for the root cause of the dual-head database is indeed that the LUKS decryption messes with the md RAID assembly at boot, e. g. some timing issue or race condition. The database content doesn't change constantly, there are very few writes per day, so I'll rely on daily backups and monitor the RAID closely. There was another kernel update today, so I'll see what happens after a reboot, which probably was what triggered the issue in the past. If another issue occurs, I'll probably have a chance to find more information in the logs now that I know what to look for. If my assumption is confirmed, I'll change the order to RAID before LUKS and restore the data from backup. (Or I'll do it anyway out of lack of other ideas...)

--- PyGate Linux v1.5.17
* Origin: Dragon's Lair, PyGate NNTP<>Fido Gate (3:633/10)

From David Christensen@3:633/10 to All on Monday, June 22, 2026 10:10:01

Subject: Re: How to salvage a degraded mdadm RAID1 with as little data loss as possible?

On 6/21/26 12:46, Paul Leiber wrote:

Am 21.06.26 um 06:47 schrieb David Christensen:

Do you own a power supply tester?

... I will put a test of
the power supply on the list of things to try.

Good.

Does your computer have ECC memory?

Yes, it does have ECC memory.

Good.

I will put the memory test on the list as well.

Choose your testing tool and methodology carefully -- memtest86
(commercial) vs. memtest86+ (FOSS), memory correction report logging to motherboard firmware vs. operating system, etc.. See this thread and
research carefully:

https://lists.debian.org/debian-user/2026/05/msg00386.html

Have you tested your hard disks?

Both disks are monitored via smartctl. Automated short and long tests
are being done regularly. There are no indications for hardware failure
in the smart data.

Good.

Have you validated the filesystem with fsck.xfs(8)?

I just did a check of the file system (using xfs_repair -n), with no
errors reported.

Good.

Do you have streams of database transactions since the last known good
backups?� If so, can they be replayed?

Not exactly knowing what such a stream is, I guess I don't have one. But
I am not sure. Will check.

Good.

Can you switch the databases to read-only, shutdown, disconnect the
first disk, boot, backup the database(s), shutdown, connect the first
disk, disconnect the second disk, boot, backup the database(s),
shutdown, and connect the second disk?

That's a good suggestion. I will need to check what booting with just
one disk could do to the BTRFS filesystem, but this might be a way to
force md to use the disk which is currently indicated as failed.

Is your OS on the btrfs mirror? I have found that putting the OS on a dedicated SSD makes operations, maintenance, trouble-shooting, disaster preparedness/ recovery, etc., much easier.

On 6/21/26 14:45, Paul Leiber wrote:

I managed to rebuild the md RAID1 using the data on the seemingly
failed device (partition_1). First, I did a dd dump of partition_2 (currently in usage) in order not to lose data. Then, I recreated the
RAID1 using the data from partition_1:

mdadm --stop /dev/md0 # This stops the degraded RAID1
mdadm --assemble --update=uuid /dev/md0 /dev/mapper/partition_1 # This creates a new RAID1 using the partition_1, a new array UUID is
required
in order for --assembly to work
mdadm --manage --add /dev/md0 /dev/mapper/partition_2 # This adds partition_2 to the RAID1, contents of partition_1 are replicated to partition_2 automatically

md is currently replicating the data (German movie descriptions in
Kodi, yay!) from partition_1 to partition_2. I might have to turn partition_2 from "spare" to "active", but I'll let the replication
complete first.

In any case, I set up mdmonitor to alert me if the RAID1 degrades
again. That's something I should have thought of earlier.

We'll see if this issue occurs again. I'll give an update if this is
the case.

Thanks to everybody for trying to help me!

Paul

Thank you for the curious problem. We all learn when we work together
on a solution. Please let us know how it works out.

David

--- PyGate Linux v1.5.17
* Origin: Dragon's Lair, PyGate NNTP<>Fido Gate (3:633/10)

From David Christensen@3:633/10 to All on Monday, June 22, 2026 10:50:01

Subject: Re: How to salvage a degraded mdadm RAID1 with as little data loss as possible?

On 6/21/26 23:18, Michel Verdier wrote:

RAID is always before LUKS : partition > RAID array > LUKS >
filesystem

On 6/22/26 00:35, tomas@tuxteam.de wrote:

"Is always" means for you "should always be" or "has to be"?

As far as I understand OP, their case is the other way around (and I
don't see why it shouldn't be technically possible: a block device
is a block device is a block device, after all).

On 6/22/26 01:04, Paul Leiber wrote:

Tomas' description of my setup is correct, LUKS before RAID. It has
been working in the past, and it is working right now again. Is this
type of setup recommended? I don't know. BTRFS doesn't show any
issues with this setup.

Stackable I/O layers is a feature of Linux and other operating systems. Depending upon which layers you want, there may be more than one way to
stack them.

An advantage of:

partitions > md RAID > LUKS > filesystem

Versus:

partitions > LUKS > md RAID > filesystem

Is that the former only has to do the encryption once for the RAID
virtual device, while the latter has to do encryption N times; once for
each partition.

When you have a layer that combines RAID, volume management, and
filesystems, such as ZFS and btrfs, the stackable encryption layer must
be underneath (e.g. the latter of above two I/O layering configurations).

For N=2, magnetic hard disk drives, and a 2+ core processor with
hardware cryptographic acceleration (e.g. Intel AES-NI), your current
I/O layering configuration should be okay.

David

--- PyGate Linux v1.5.17
* Origin: Dragon's Lair, PyGate NNTP<>Fido Gate (3:633/10)

From Andy Smith@3:633/10 to All on Monday, June 22, 2026 15:40:01

Subject: Re: How to salvage a degraded mdadm RAID1 with as little data loss as possible?

Hi,

The lack of an mdadm.conf should not cause you any issues. It's only
really used to set non-default options, give a monitoring email address
and so on.

udev incrementally assembles MDADM arrays as devices appear. It does not
need any configuration to do this. In order to end up in the situation
OP is in, I can only imagine that they rebooted and only one of the
LUKS devices was set up, so md0 proceeded in a degraded fashion with
that device.

What is confusing to me is how OP had an active mdadm array member with
an event count significantly *behind* the inactive one. It makes me
think that this may have happened more than once, with different single
LUKS devices being activated each time.

The mdadm monitor daemon runs by default and should email you about
degraded arrays. Without any configuration that would be sending to root@localhost. OP should make sure that these emails will arrive
somewhere useful, or look into other ways of checking status of mdadm
arrays. What's happened here was likely trivial to fix at the
time of first problem but became a complete nightmare that likely
involved data loss (OP has backup of a device with unique data that
cannot be integrated).

OP, after sorting out the monitoring I think you need to verify that
both LUKS devices are always successfully unlocked and available at boot
so that the RAID 1 assembles fully and properly.

I think it's unlikely that you have had a hardware failure of the
underlying drives, though you should of course check your logs and
smartctl for that. Given that LUKS is in use and is the most complicated
thing in your storage stack, I'd be looking into whether both LUKS
devices are being reliably created.

If setting this system up from scratch my preference would be to do the redundancy as near to the hardware as possible and the encryption as far
away as possible. So I'd put LUKS on md0, not md0 on two LUKS devices.

Thanks,
Andy

--
https://bitfolk.com/ -- No-nonsense VPS hosting

--- PyGate Linux v1.5.17
* Origin: Dragon's Lair, PyGate NNTP<>Fido Gate (3:633/10)

From Paul Leiber@3:633/10 to All on Monday, June 22, 2026 18:10:01

Subject: Re: How to salvage a degraded mdadm RAID1 with as little data loss as possible?

Am 22.06.26 um 15:30 schrieb Andy Smith:

Hi,

The lack of an mdadm.conf should not cause you any issues. It's only
really used to set non-default options, give a monitoring email address
and so on.

O.k. I added an entry to mdadm.conf anyway, it shouldn't hurt at least.

udev incrementally assembles MDADM arrays as devices appear. It does not
need any configuration to do this. In order to end up in the situation
OP is in, I can only imagine that they rebooted and only one of the
LUKS devices was set up, so md0 proceeded in a degraded fashion with
that device.

What is confusing to me is how OP had an active mdadm array member with
an event count significantly *behind* the inactive one. It makes me
think that this may have happened more than once, with different single
LUKS devices being activated each time.

The switch of active and inactive devices happened definitely more than once. The md array most likely switched between the LUKS devices at boot several times, therefore the different event counts. The device with newer data of course had the higher event count, as it was the one the data had been written on in the weeks before the latest switch. My best guess is also that something happened while the LUKS devices have been created which made md believe that one device is not intact or available.

The mdadm monitor daemon runs by default and should email you about
degraded arrays. Without any configuration that would be sending to root@localhost. OP should make sure that these emails will arrive
somewhere useful, or look into other ways of checking status of mdadm
arrays. What's happened here was likely trivial to fix at the
time of first problem but became a complete nightmare that likely
involved data loss (OP has backup of a device with unique data that
cannot be integrated).

Oh well, experience is never too expensive, a German saying goes... This issue will not bug me again as much as it did. The next time I curse, it will be due a different issue, I am sure. :-)

OP, after sorting out the monitoring I think you need to verify that
both LUKS devices are always successfully unlocked and available at boot
so that the RAID 1 assembles fully and properly.

mdmon is now set up, I tested that e-mails actually arrive.Latest news: The md array survived a first reboot today.

I think it's unlikely that you have had a hardware failure of the
underlying drives, though you should of course check your logs and
smartctl for that. Given that LUKS is in use and is the most complicated thing in your storage stack, I'd be looking into whether both LUKS
devices are being reliably created.

O.k.

If setting this system up from scratch my preference would be to do the redundancy as near to the hardware as possible and the encryption as far
away as possible. So I'd put LUKS on md0, not md0 on two LUKS devices.

Thank you very much for your advice!

Paul

--- PyGate Linux v1.5.17
* Origin: Dragon's Lair, PyGate NNTP<>Fido Gate (3:633/10)

From Max Nikulin@3:633/10 to All on Monday, June 22, 2026 18:40:01

Subject: Re: How to salvage a degraded mdadm RAID1 with as little data loss as possible?

On 22/06/2026 3:04 pm, Paul Leiber wrote:

my main suspect for the root cause of the dual-head database

I read about rollback to earlier state of filesystem when a device
supporting snapshots (LVM or filesystem) was mounted using FS UUID
instead of volume identifier. Snapshots have the same UUID as the real
device, so it is undefined what is found firs on boot. Just ignore this
remark if snapshots are not supported on all stack levels of your
storage for the DB.

--- PyGate Linux v1.5.17
* Origin: Dragon's Lair, PyGate NNTP<>Fido Gate (3:633/10)

From Michel Verdier@3:633/10 to All on Tuesday, June 23, 2026 08:10:01

Subject: Re: How to salvage a degraded mdadm RAID1 with as little data loss as possible?

On 2026-06-22, tomas@tuxteam.de wrote:

RAID is always before LUKS : partition > RAID array > LUKS > filesystem

"Is always" means for you "should always be" or "has to be"?

"has to be". LUKS encrypt a partition in a unique way. So 2 encrypted partitions are always different and cannot be synced.

As far as I understand OP, their case is the other way around (and I don't see why it shouldn't be technically possible: a block device is a block device is a block device, after all).

Perhaps the problem but I don't have enough informations on its
installation.

--- PyGate Linux v1.5.17
* Origin: Dragon's Lair, PyGate NNTP<>Fido Gate (3:633/10)

From Michel Verdier@3:633/10 to All on Tuesday, June 23, 2026 08:30:01

Subject: Re: How to salvage a degraded mdadm RAID1 with as little data loss as possible?

On 2026-06-22, Paul Leiber wrote:

RAID is always before LUKS : partition > RAID array > LUKS > filesystem

"Is always" means for you "should always be" or "has to be"?

As far as I understand OP, their case is the other way around (and I don't >> see why it shouldn't be technically possible: a block device is a block
device is a block device, after all).

Tomas' description of my setup is correct, LUKS before RAID. It has
been working in the past, and it is working right now again. Is this
type of setup recommended? I don't know. BTRFS doesn't show any issues
with this setup.

So Tomas found your problem. It is at best useless to have
partition > LUKS > RAID array > filesystem
I cannot see how it managed to work. It suppose the 2 LUKS are identical
which is a nonsense. Also a small change in data gives a bigger change in
a LUKS partition thus bigger to sync. I don't know enough about LUKS but
I suppose you loose LUKS atomicity during sync.

However, my main suspect for the root cause of the dual-head database is indeed that the LUKS decryption messes with the md RAID assembly at boot,
e. g. some timing issue or race condition. The database content doesn't change
constantly, there are very few writes per day, so I'll rely on daily backups and monitor the RAID closely. There was another kernel update today, so I'll see what happens after a reboot, which probably was what triggered the issue in the past. If another issue occurs, I'll probably have a chance to find more
information in the logs now that I know what to look for. If my assumption is confirmed, I'll change the order to RAID before LUKS and restore the data from
backup. (Or I'll do it anyway out of lack of other ideas...)

You are right to suspect that. Don't wait and change it even if you can't confirm a bug. The good and safe way is
partition > RAID array > LUKS > filesystem
And you should also improve performances.

--- PyGate Linux v1.5.17
* Origin: Dragon's Lair, PyGate NNTP<>Fido Gate (3:633/10)

From tomas@3:633/10 to All on Tuesday, June 23, 2026 10:30:01

Subject: Re: How to salvage a degraded mdadm RAID1 with as little data loss as possible?

On Tue, Jun 23, 2026 at 08:00:11AM +0200, Michel Verdier wrote:

On 2026-06-22, tomas@tuxteam.de wrote:

RAID is always before LUKS : partition > RAID array > LUKS > filesystem

"Is always" means for you "should always be" or "has to be"?

"has to be". LUKS encrypt a partition in a unique way. So 2 encrypted partitions are always different and cannot be synced.

I think that is wrong. You don't sync the *encrypted* partitions (how would you?) but the decrypted block layer, one level up. I don't see a reason it wouldn't work.

As far as I understand OP, their case is the other way around (and I don't see why it shouldn't be technically possible: a block device is a block device is a block device, after all).

Perhaps the problem but I don't have enough informations on its
installation.

OP's initial description was (to me) so clear that I think I understood
it.
Cheers
--
tom�s

--- PyGate Linux v1.5.17
* Origin: Dragon's Lair, PyGate NNTP<>Fido Gate (3:633/10)

From tomas@3:633/10 to All on Tuesday, June 23, 2026 10:40:02

Subject: Re: How to salvage a degraded mdadm RAID1 with as little data loss as possible?

On Tue, Jun 23, 2026 at 08:28:50AM +0200, Michel Verdier wrote:

On 2026-06-22, Paul Leiber wrote:

RAID is always before LUKS : partition > RAID array > LUKS > filesystem >> "Is always" means for you "should always be" or "has to be"?

As far as I understand OP, their case is the other way around (and I don't >> see why it shouldn't be technically possible: a block device is a block
device is a block device, after all).

Tomas' description of my setup is correct, LUKS before RAID. It has
been working in the past, and it is working right now again. Is this
type of setup recommended? I don't know. BTRFS doesn't show any issues
with this setup.

So Tomas found your problem. It is at best useless to have
partition > LUKS > RAID array > filesystem

I strongly disagree here.

I cannot see how it managed to work. It suppose the 2 LUKS are identical which is a nonsense. Also a small change in data gives a bigger change in
a LUKS partition thus bigger to sync. I don't know enough about LUKS but
I suppose you loose LUKS atomicity during sync.

Not the LUKS are identical. Their decrypted layers are, ideally. Of
course this costs additional processing power (you have to de-/encrypt
things twice), and I don't (yet) see an advantage to this scheme, but
it is definitely feasible.
Cheers
--
t

--- PyGate Linux v1.5.17
* Origin: Dragon's Lair, PyGate NNTP<>Fido Gate (3:633/10)

From Paul Leiber@3:633/10 to All on Tuesday, June 23, 2026 14:30:01

Subject: Re: How to salvage a degraded mdadm RAID1 with as little data loss as possible?

Am 23.06.26 um 10:22 schrieb tomas@tuxteam.de:

On Tue, Jun 23, 2026 at 08:00:11AM +0200, Michel Verdier wrote:

On 2026-06-22, tomas@tuxteam.de wrote:

RAID is always before LUKS : partition > RAID array > LUKS > filesystem >>> "Is always" means for you "should always be" or "has to be"?

"has to be". LUKS encrypt a partition in a unique way. So 2 encrypted
partitions are always different and cannot be synced.

I think that is wrong. You don't sync the *encrypted* partitions (how would you?) but the decrypted block layer, one level up. I don't see a reason it wouldn't work.

Thomas is correct. The decrypted devices are assembled and synced, not the encrypted devices.

And my experience shows that it is not mandatory to have RAID before LUKS. My btrfs RAID1 has been running for years in this way without any issue. Right now, the md RAID1 is doing what it should be doing. (Yeah, that's right, I am watching you, md0!) David has even pointed out in another mail that it is mandatory to use LUKS before RAID in special cases:

Am 22.06.26 um 10:42 schrieb David Christensen:

When you have a layer that combines RAID, volume management, and filesystems, such as ZFS and btrfs, the stackable encryption layer must be underneath (e.g. the latter of above two I/O layering configurations).

I think this is technically correct, as btrfs is a filesystem and doesn't provide a block device that can be encrypted via LUKS, IIUC. Please correct me if I am wrong. (Coming to think of it: The btrfs RAID1 was first on the disk, the md RAID1 came much later. Most likely I just transferred the way the btrfs RAID1 is set up to the md RAID1, without thinking.)

Now, am I saying that LUKS before *md* RAID is a smart setup? No, I am not. Probably there are good reasons to do it the other way round. And I still think it is likely that my issue comes from some hickup in the way the RAID is assembled at boot from the decrypted devices. However, I recommend to not jump to conclusions before we have more data on this. And it might be beneficial to actually find out what the root cause for my issue is in order to be able to fix it. If I have been doing it this way, chances are that somebody else is doing it this way as well...

Anyway, thanks, I learned a lot again!

Paul

--- PyGate Linux v1.5.17
* Origin: Dragon's Lair, PyGate NNTP<>Fido Gate (3:633/10)

From Paul Leiber@3:633/10 to All on Tuesday, June 23, 2026 14:30:01

Subject: Re: How to salvage a degraded mdadm RAID1 with as little data loss as possible?

Am 22.06.26 um 10:06 schrieb David Christensen:

(...)

Is your OS on the btrfs mirror?� I have found that putting the OS on a dedicated SSD makes operations, maintenance, trouble-shooting, disaster preparedness/ recovery, etc., much easier.

No, the OS is on a single, separate SSD. The two hard drives (with BTRFS and md RAID1s) are for data storage only. I was considering to create another RAID1 for the OS with a second SSD, but so far I refrained from doing so.

--- PyGate Linux v1.5.17
* Origin: Dragon's Lair, PyGate NNTP<>Fido Gate (3:633/10)

From David Christensen@3:633/10 to All on Tuesday, June 23, 2026 19:40:01

Subject: Re: How to salvage a degraded mdadm RAID1 with as little data loss as possible?

On 6/23/26 01:22, tomas@tuxteam.de wrote:

On Tue, Jun 23, 2026 at 08:00:11AM +0200, Michel Verdier wrote:

On 2026-06-22, tomas@tuxteam.de wrote:

On 6/21/26 23:18, Michel Verdier wrote:

RAID is always before LUKS : partition > RAID array > LUKS > filesystem >>>

"Is always" means for you "should always be" or "has to be"?

"has to be". LUKS encrypt a partition in a unique way. So 2 encrypted
partitions are always different and cannot be synced.

I think that is wrong.

+1

Both configurations work, but have different performance and security considerations:

* partitions > RAID > encryption > filesystem

Will encrypt the RAID virtual block device, saving CPU cycles and requiring one passphrase and/or key.

* partitions > encryption > RAID > filesystem

Will encrypt each partition, arguably improving security but
requiring more CPU cycles and passphrases/ keys.

My SOHO file server uses ZFS, which combines RAID > filesystem. (ZFS
native encryption has issues, so I avoid it.) So, the file server must
use a variation of the above latter I/O layering configuration:

partitions > encryption > ZFS

David

--- PyGate Linux v1.5.17
* Origin: Dragon's Lair, PyGate NNTP<>Fido Gate (3:633/10)

From tomas@3:633/10 to All on Tuesday, June 23, 2026 20:20:01

Subject: Re: How to salvage a degraded mdadm RAID1 with as little data loss as possible?

On Tue, Jun 23, 2026 at 10:33:01AM -0700, David Christensen wrote:
[...]

Both configurations work, but have different performance and security considerations:

* partitions > RAID > encryption > filesystem

Will encrypt the RAID virtual block device, saving CPU cycles and requiring one passphrase and/or key.

* partitions > encryption > RAID > filesystem

Will encrypt each partition, arguably improving security but requiring more CPU cycles and passphrases/ keys.

Actually it would reduce security, IMO, because the opponent would have
to find just one of both keys (the content is mirrored), thus potentially reducing the key strength by one bit. Not a big deal, granted :)
Cheers
--
tom�s

--- PyGate Linux v1.5.17
* Origin: Dragon's Lair, PyGate NNTP<>Fido Gate (3:633/10)

From David Christensen@3:633/10 to All on Wednesday, June 24, 2026 00:10:02

Subject: Re: How to salvage a degraded mdadm RAID1 with as little data loss as possible?

On 6/23/26 11:10, tomas@tuxteam.de wrote:

On Tue, Jun 23, 2026 at 10:33:01AM -0700, David Christensen wrote:

[...]

Both configurations work, but have different performance and security
considerations:

* partitions > RAID > encryption > filesystem

Will encrypt the RAID virtual block device, saving CPU cycles and
requiring one passphrase and/or key.

* partitions > encryption > RAID > filesystem

Will encrypt each partition, arguably improving security but requiring >> more CPU cycles and passphrases/ keys.

Actually it would reduce security, IMO, because the opponent would have
to find just one of both keys (the content is mirrored), thus potentially reducing the key strength by one bit. Not a big deal, granted :)

Cheers

I agree that successfully cracking two or more disks from an encrypted
RAID will give an attacker greater confidence in the resulting data and metadata.

But I would expect a cracking algorithm for an encryption layer with
on-disk cryptographic details (e.g. LUKS header) would primarily attack
those on-disk cryptographic details:

* Assuming a brute-force cracking algorithm, each crack attempt (e.g. passphrase and/or key generated by an iterator) is an independent trial
and the work is readily partitioned across multiple computers working in parallel. So, cracking 1 LUKS header with N computers will take the
same average time as cracking any one of 2 to N different LUKS headers
with N computers.

* What an attacker wants is a cracking algorithm where each new cracking attempt leverages the results from previous failed attempts. AIUI LUKS, dm-crypt, and other professional cryptographic systems are specifically designed to thwart such. But if you design such an algorithm, you could become famous, make money, become an enemy of the state, go to prison,
flee into exile, etc..

I was thinking of what happens if a disk fails, the sysadmin disposes of
the disk, an attacker obtains the disk, and the attacker successfully
cracks the encryption. The attacker now has all or part of the
plaintext data, the plaintext metadata, and the plaintext cryptographic details at the time the disk failed:

* If encryption was applied on top of RAID and the attacker obtains a
second encrypted disk, the attacker can use the plaintext cryptographic details from the first disk to crack the second disk. This could be as
simple as entering the passphrase and/or key from the first disk.

* If encryption was applied under RAID and the sysadmin used different
strong passphrases and/or keys on every disk, the plaintext
cryptographic details from any one cracked disk will not help to crack additional encrypted disks.

David

--- PyGate Linux v1.5.17
* Origin: Dragon's Lair, PyGate NNTP<>Fido Gate (3:633/10)

From tomas@3:633/10 to All on Wednesday, June 24, 2026 07:50:02

Subject: Re: How to salvage a degraded mdadm RAID1 with as little data loss as possible?

On Tue, Jun 23, 2026 at 03:04:38PM -0700, David Christensen wrote:

On 6/23/26 11:10, tomas@tuxteam.de wrote:

On Tue, Jun 23, 2026 at 10:33:01AM -0700, David Christensen wrote:

[...]

Both configurations work, but have different performance and security considerations:

* partitions > RAID > encryption > filesystem

Will encrypt the RAID virtual block device, saving CPU cycles and requiring one passphrase and/or key.

* partitions > encryption > RAID > filesystem

Will encrypt each partition, arguably improving security but requiring
more CPU cycles and passphrases/ keys.

Actually it would reduce security, IMO, because the opponent would have
to find just one of both keys (the content is mirrored), thus potentially reducing the key strength by one bit. Not a big deal, granted :)

Cheers

I agree that successfully cracking two or more disks from an encrypted RAID will give an attacker greater confidence in the resulting data and metadata.

No, no: I meant the attacker has to crack *just one of two*, thus
potentially halving the search time (assuming enough parallelism,
which seems a semsible to assume in these crazy days we live in).

But I would expect a cracking algorithm for an encryption layer with on-disk cryptographic details (e.g. LUKS header) would primarily attack those
on-disk cryptographic details:

* Assuming a brute-force cracking algorithm, each crack attempt (e.g. passphrase and/or key generated by an iterator) is an independent trial and the work is readily partitioned across multiple computers working in parallel. So, cracking 1 LUKS header with N computers will take the same average time as cracking any one of 2 to N different LUKS headers with N computers.

Now that makes sense to me: space ? time is constant, you double the
one and halve the other. You're right.

* What an attacker wants is a cracking algorithm where each new cracking attempt leverages the results from previous failed attempts. AIUI LUKS, dm-crypt, and other professional cryptographic systems are specifically designed to thwart such. But if you design such an algorithm, you could become famous, make money, become an enemy of the state, go to prison, flee into exile, etc..

I'd expect that, yes. Current attacks seem to concentrate on the PBKDF,
that's why argon2, specifically argon2id [1] [2] is currently recommended
(it makes highly parallel attacks by SIMD GPUs difficult)

I was thinking of what happens if a disk fails, the sysadmin disposes of the disk, an attacker obtains the disk, and the attacker successfully cracks the encryption. The attacker now has all or part of the plaintext data, the plaintext metadata, and the plaintext cryptographic details at the time the disk failed:

Never do that. If the electronics still work to dd to the first sectors
of the disk, by all means, do.

* If encryption was applied on top of RAID and the attacker obtains a second encrypted disk, the attacker can use the plaintext cryptographic details
from the first disk to crack the second disk. This could be as simple as entering the passphrase and/or key from the first disk.

* If encryption was applied under RAID and the sysadmin used different
strong passphrases and/or keys on every disk, the plaintext cryptographic details from any one cracked disk will not help to crack additional
encrypted disks.

Which you don't need to, since we are talking RAID1, and they should
have (roughly ;) equal content.
Other RAID schemata are different, granted.
Cheers
--
t

--- PyGate Linux v1.5.18
* Origin: Dragon's Lair, PyGate NNTP<>Fido Gate (3:633/10)

From David Christensen@3:633/10 to All on Wednesday, June 24, 2026 19:40:02

Subject: Re: How to salvage a degraded mdadm RAID1 with as little data loss as possible?

On 6/23/26 22:39, tomas@tuxteam.de wrote:

On Tue, Jun 23, 2026 at 03:04:38PM -0700, David Christensen wrote:

I was thinking of what happens if a disk fails, the sysadmin disposes of the >> disk, an attacker obtains the disk, and the attacker successfully cracks the >> encryption. The attacker now has all or part of the plaintext data, the
plaintext metadata, and the plaintext cryptographic details at the time the >> disk failed:

Never do that. If the electronics still work to dd to the first sectors
of the disk, by all means, do.

Using software to write zeroes to a drive will get the sectors that the
drive controller allows the host to see, but HDD failed/ remapped
sectors will still contain content; as will dirty SSD/ USB flash sectors waiting to be erased. If a skilled attacker obtains the drive at this
point, the remaining data could be compromised.

ATA Secure Erase is supposed to get more (all?) sectors, but I do not
know what happens with broken sectors.

I have heard of people using magnetic erasers for magnetic HDD's.

I have heard of disk shredding and/or incineration services, but that is
above my scale.

My practice has been zeroes and/or secure erase, followed by a 3 pound drilling hammer.

David

--- PyGate Linux v1.5.18
* Origin: Dragon's Lair, PyGate NNTP<>Fido Gate (3:633/10)

From Nicolas George@3:633/10 to All on Wednesday, June 24, 2026 19:50:02

Subject: Re: How to salvage a degraded mdadm RAID1 with as little data loss as possible?

David Christensen (HE12026-06-24):

I have heard of disk shredding and/or incineration services, but that is above my scale.

I confirm. My chief organized that for our school last year. I think
they used a hydraulic press, but I was not involved in the process and
did not go near and see.

Regards,

--
Nicolas George

--- PyGate Linux v1.5.18
* Origin: Dragon's Lair, PyGate NNTP<>Fido Gate (3:633/10)

Who's Online
Recent Visitors
- Wang Bu
  Monday, June 22, 2026 08:10:33
  from Manila, Philippines via Telnet
- Wang Bu
  Monday, June 22, 2026 07:54:48
  from Manila, Philippines via Telnet
- Wang Bu
  Saturday, June 20, 2026 19:49:49
  from Manila, Philippines via Telnet
- Wang Bu
  Sunday, June 14, 2026 19:13:00
  from Manila, Philippines via Telnet

System Info

Sysop:	Jacob Catayoc
Location:	Pasay City, Metro Manila, Philippines
Users:	4
Nodes:	4 (0 / 4)
Uptime:	495145:22:59
Calls:	165
Files:	574
D/L today:	29 files (9,998K bytes)
Messages:	78,198

How to salvage a degraded mdadm RAID1 with as little data loss as poss

Who's Online

Recent Visitors

System Info