Somehow, I missed to include the RAID1 information for md0 to the configuration file (e. g. by entering root@localhost:~# mdadm
--detail --scan /dev/md0 >> /etc/mdadm/mdadm.conf). I am not sure if
this actually is the cause and adding that information would solve
the issue.
My questions are the following:
1. Is my analysis valid in principle? Especially: Could the root
cause for this issue be that mdadm.conf is missing the information
for md0, and could adding that information prevent data loss or inconsistencies in the future?
2. Can I (re)create the RAID1 md0 or (re-)add the missing partition
in an easy way that no or at least not all information is lost? If
yes, how?
Hi everybody,
I am using a MDADM software RAID1 as a dedicated MariaDB database file system. The devices used for the RAID1 are two partitions of identical
size which are LUKS encrypted. The devices are decrypted via entries
in /etc/crypttab. The resulting RAID1 is called /dev/md0, formatted as
XFS. (For completeness' sake: md0 is then forwarded to a database VM
which stores the database on the device, but that shouldn't play a role
for my questions, IIUC.)
Some time ago, I noticed that the database content changed after a
reboot. Recent changes to the databases were seemingly lost. I couldn't pinpoint the cause for this, but attributed it to an unclean shutdown of
the database prior to reboot of the database VM. Data loss in a database
of course is not ideal, so I kept on looking. It seems that I have now identified the root cause for the data loss in the RAID1.
I checked the RAID1:
root@xxx:~# cat /proc/mdstat
Personalities : [raid0] [raid1] [raid6] [raid5] [raid4] [raid10]
md0 : active raid1 dm-30[2]
˙ ˙ ˙ 1073593280 blocks super 1.2 [2/1] [_U]
˙ ˙ ˙ bitmap: 7/8 pages [28KB], 65536KB chunk
The [_U] seems to indicate that the RAID1 is currently degraded and
using just one of the two partitions is currently used for the RAID1.
Checking the partitions the RAID1 is based on gives the following output:
root@xxx:~# mdadm --examine /dev/dm-31
/dev/dm-31:
˙ ˙ ˙ ˙ ˙ Magic : a92b4efc
˙ ˙ ˙ ˙ Version : 1.2
˙ ˙ Feature Map : 0x1
˙ ˙ ˙Array UUID : 6834a10d:edb03a51:cef24158:f9abc812
˙ ˙ ˙ ˙ ˙ ˙Name : xxx:0˙ (local to host xxx)
˙ Creation Time : Fri Nov˙ 4 16:05:45 2022
˙ ˙ ˙Raid Level : raid1
˙ ˙Raid Devices : 2
˙Avail Dev Size : 2147186655 sectors (1023.86 GiB 1099.36 GB)
˙ ˙ ˙Array Size : 1073593280 KiB (1023.86 GiB 1099.36 GB)
˙ Used Dev Size : 2147186560 sectors (1023.86 GiB 1099.36 GB)
˙ ˙ Data Offset : 264192 sectors
˙ ˙Super Offset : 8 sectors
˙ ˙Unused Space : before=264112 sectors, after=95 sectors
˙ ˙ ˙ ˙ ˙ State : clean
˙ ˙ Device UUID : 01c96166:ee782cc7:57bcf889:2ee53b43
Internal Bitmap : 8 sectors from superblock
˙ ˙ Update Time : Wed Jun 17 13:17:46 2026
˙ Bad Block Log : 512 entries available at offset 16 sectors
˙ ˙ ˙ ˙Checksum : d46fa108 - correct
˙ ˙ ˙ ˙ ˙Events : 5397997
˙ ˙Device Role : Active device 0
˙ ˙Array State : A. ('A' == active, '.' == missing, 'R' == replacing)
and
root@xxx:~# mdadm --examine /dev/dm-30
/dev/dm-30:
˙ ˙ ˙ ˙ ˙ Magic : a92b4efc
˙ ˙ ˙ ˙ Version : 1.2
˙ ˙ Feature Map : 0x1
˙ ˙ ˙Array UUID : 6834a10d:edb03a51:cef24158:f9abc812
˙ ˙ ˙ ˙ ˙ ˙Name : xxx:0˙ (local to host xxx)
˙ Creation Time : Fri Nov˙ 4 16:05:45 2022
˙ ˙ ˙Raid Level : raid1
˙ ˙Raid Devices : 2
˙Avail Dev Size : 2147186655 sectors (1023.86 GiB 1099.36 GB)
˙ ˙ ˙Array Size : 1073593280 KiB (1023.86 GiB 1099.36 GB)
˙ Used Dev Size : 2147186560 sectors (1023.86 GiB 1099.36 GB)
˙ ˙ Data Offset : 264192 sectors
˙ ˙Super Offset : 8 sectors
˙ ˙Unused Space : before=264112 sectors, after=95 sectors
˙ ˙ ˙ ˙ ˙ State : clean
˙ ˙ Device UUID : 637fc155:8fb21b7c:fff27b71:c7ea1094
Internal Bitmap : 8 sectors from superblock
˙ ˙ Update Time : Fri Jun 19 21:34:19 2026
˙ Bad Block Log : 512 entries available at offset 16 sectors
˙ ˙ ˙ ˙Checksum : b110fe9d - correct
˙ ˙ ˙ ˙ ˙Events : 4814810
˙ ˙Device Role : Active device 1
˙ ˙Array State : .A ('A' == active, '.' == missing, 'R' == replacing)
It can be seen that update time and number of events differ between both partitions, which seems to indicate different data. I am assuming that
due to some circumstance (wild guess: a race condition when unlocking
the LUKS encryption), the RAID1 is more or less randomly using only one
of the partitions, which then results in differing database versions, depending on which of the two partitions is currently used.
I also think that I found a possible cause for this misbehaviour. My / etc/mdadm/mdadm.conf contains just the default settings:
# mdadm.conf
#
# !NB! Run update-initramfs -u after updating this file.
# !NB! This will ensure that initramfs has an uptodate copy.
#
# Please refer to mdadm.conf(5) for information about this file.
#
# by default (built-in), scan all partitions (/proc/partitions) and all
# containers for MD superblocks. alternatively, specify devices to scan, using
# wildcards if desired.
#DEVICE partitions containers
# automatically tag new arrays as belonging to the local system
HOMEHOST <system>
# instruct the monitoring daemon where to send mail alerts
MAILADDR root
# definitions of existing MD arrays
# This configuration was auto-generated on Fri, 04 Nov 2022 15:52:55
+0100 by mkconf
Somehow, I missed to include the RAID1 information for md0 to the configuration file (e. g. by entering root@localhost:~# mdadm --detail --scan /dev/md0 >> /etc/mdadm/mdadm.conf). I am not sure if this
actually is the cause and adding that information would solve the issue.
My questions are the following:
1. Is my analysis valid in principle? Especially: Could the root cause
for this issue be that mdadm.conf is missing the information for md0,
and could adding that information prevent data loss or inconsistencies
in the future?
2. Can I (re)create the RAID1 md0 or (re-)add the missing partition in
an easy way that no or at least not all information is lost? If yes, how?
I assume that it might not be possible to sync the data from two
different database versions without data loss. If this assumption is correct, I am willing to use one data set (e. g. the one on dm-31) and discard the other data set (e. g. the one on dm-30). Guides I found so
far describe how to set up a new RAID1 and copy the data from a
partition to the new RAID1. However, perhaps I am wondering if it is possible to (re-)create a RAID1 using just one existing partition (e. g. dm-31) without losing the data on this partition, and then add the other partition to the RAID1?
The databases are backed up regularly. However, the backup is
incremental, and it seems that the different database versions are
messing up the incremental backup, therefore my last valid backup
doesn't include the most recent changes to the database. If it is not possible to salvage the data on one or both of the partitions, I could swallow the bitter pill and go back to a previous database state without unacceptable consequences. However, I would like to try to salvage as
much data as possible.
Thank you in advance,
Paul
AIUI neither XFS nor mdadm compute, store, or verify checksums of
data or metadata on disk.
David Christensen <dpchrist@holgerdanske.com> wrote:
AIUI neither XFS nor mdadm compute, store, or verify checksums of
data or metadata on disk.
XFS checksums metadata for many years now (12?), but it doesn't checksum
user data.
On Sat, 20 Jun 2026 18:01:15 +0200
Paul Leiber <paul@onlineschubla.de> wrote:
Somehow, I missed to include the RAID1 information for md0 to theI doubt that this is the culprit. the man page for mdadm says, in part:
configuration file (e. g. by entering root@localhost:~# mdadm
--detail --scan /dev/md0 >> /etc/mdadm/mdadm.conf). I am not sure if
this actually is the cause and adding that information would solve
the issue.
My questions are the following:
1. Is my analysis valid in principle? Especially: Could the root
cause for this issue be that mdadm.conf is missing the information
for md0, and could adding that information prevent data loss or
inconsistencies in the future?
Assemble
Assemble the components of a previously created array
into an active array. Components can be explicitly given or can
be searched for. mdadm checks that the components do form a
bona fide array, and can, on request, fiddle superblock
information so as to assemble a faulty array.
So mdadm *should* find both devices. But it might not be. And adding
that line will not hurt. I have a similar line in my mdadm.conf.
I built my RAID array up a bit differently that you did yours. You made
your partitions, put LUKS on the partitions, then the RAID on top of
that. I have the partitions, then the RAID array, LUKS on top of that,
then LVM, with file systems on top of the LVs. But I know of no reason
your setup shouldn't work.
I have found that when I have multiple LUKS partitions, giving them
all the same passphrase means I need give only one passphrase to
decrypt on boot.
2. Can I (re)create the RAID1 md0 or (re-)add the missing partitionYes. For the gory details see https://oneuptime.com/blog/post/2026-03-02-how-to-replace-a-failed-disk-in-mdadm-raid-on-ubuntu/view.
in an easy way that no or at least not all information is lost? If
yes, how?
In short,
* Fail the offending disk. It looks like this has already happened, but
it shouldn't hurt to do it again.
* Remove the disk from the array.
* Add the disk back in again. This should trigger rebuilding, which
takes a while. During the rebuild, the data should be both readable
and writable. You may monitor with:
cat /proc/mdstat
I just noticed that I didn't manage to make clear that (1) I don't
think that there is one specific failed partition, but that both
partitions containing databases seem to work, but not at the same
time, and that (2) I want to keep the data on the seemingly failed device.
David Christensen <dpchrist@holgerdanske.com> wrote:
AIUI neither XFS nor mdadm compute, store, or verify checksums of
data or metadata on disk.
XFS checksums metadata for many years now (12?), but it doesn't
checksum user data.
I've always thought that the hardware controller checksums raw disk
blocks (sectors) as part of the low-level I/O processing in the
controller hardware's "firmware" and that this is how the controller
knows it has a bad block.
I just noticed that I didn't manage to make clear that (1) I don't
think that there is one specific failed partition, but that both
partitions containing databases seem to work, but not at the same
time, and that (2) I want to keep the data on the seemingly failed
device.
Let me explain with an example: I am using KODI to access my video
data from different devices. A couple of months ago, I switched KODI
to using a centralized database (containing metadata information on
movies, watch status, etc.) in order to maintain only one database
instead of a database on each device running KODI. The data is
stored on the database VM, running MariaDB, which stores the data on
the md software RAID1 (at least that's what was supposed to happen).
I spent some time configuring the metadata, e.g. correcting mistakes
in the movie titles etc. I then noticed that I mistakenly selected
English language to display the movie descriptions. Because of
family members not fluent in English, I redid the metadata
configuration in German. (It was an annoying work, therefore I
remember it well.) Then, some time later, after a reboot of the
hypervisor (and the database VM) due to kernel updates, the
language of the movie descriptions was displayed in English again. I attributed this to a corrupt database after the database VM reboot
and loaded a database backup from some time ago, where the movie
description was still in English. So I did the metadata
configuration a third time, again in German. (I guess you can
imagine the fun I had.) Then, a couple of days ago, after a reboot
of the hypervisor and the database VM, the KODI movie description
was displayed in English again. That's when I really started
digging, because now it was clear that there were actually two
intact, but differing databases. (To be clear: There were some other
changes to other databases that also were affected in a similar
manner which I don't mention in this example, so this issue is not restricted to the KODI database).
Based on the available data, I attribute this issue to the RAID1
which seems to select one of two partitions at random when booting
the hypervisor. Indications: - The last update time in the
description of the (seemingly) failed device given by mdadm --
examine match the point in time of the switch from one database
version ("German") to the other ("English"), therefore I assume that
the switch happens at the software RAID level. - A failure at
hardware level doesn't seem likely, because how could there suddenly
be an older version of a database available in a RAID1 if one device
fails and the RAID1 is degraded, and this after entirely rebuilding
the database from a backup? And, mind you, this switch to an older
version of the database didn't happen just once, but at least two
times. The data (in English) simply shouldn't have been available
anymore at this point if the RAID1 had been working as intended.
The most likely explanation to me is that the RAID1 has been running
in a degraded state for some time (unnoticed by me), the database
changes (e. g. from English to German) were stored to just one of
the two partitions, and at some point the RAID1 switched to the
other partition after a reboot, containing intact, but older (e. g.
English) data. As a defective hardware doesn't seem likely, I assume
that something in my setup causes this behaviour by md. But of
course, I might be wrong and I am open to other explanations. For
example, what my assumption fails to explain is why the switch only
happens from time to time, and not more often, e.g. after each
reboot.
The example you kindly give is for removing a seemingly failed
partition (currently dm-30, "German" database) from a md RAID1,
keeping the data on an intact partition (currently dm-31, "English" database) and than re-fgadding a partition to the RAID1. This is
pretty straightforward: the data is kept and replicated from the
valid partition to the freshly added one. However, in my case, the
dataset I want to keep is on the seemingly failed partition not used
in the RAID (currently dm-30, "German").
Options I see (besides recreating the RAID1 from scratch and using
an available backup to restore the data, losing some data):
1. I could fail the seemingly intact partition or remove the RAID1
entirely, somehow use the seemingly failed partition (dm-30,
"German") to create a new RAID without losing the data on it, then
add the other partition (dm-31) as a new drive and have the data
replicated. I am not sure if this is possible, therefore my question
to this list.
2. Another option is to reboot the hypervisor and hope for a switch
of the RAID to the partition containing the more recent version of
the database, then follow your guide. But I am not really confident
that such a "strategy" is the best choice I have at the moment.
Also, I just tried a reboot three times, each time the data in the
database is the wrong, old one.
3. I could also backup the database from the seemingly failed
partition in order to not lose data and then use this backup to
recreate the RAID1, but I would need to mount that partition, which
ended in an error when I tried it.
And, of course, I don't want this to happen again, therefore I want
to find the root cause for this situation and fix it. If it is not
the missing information in /etc/mdadm/mdadm.conf, what else could it
be?
Sorry for the lengthy posts, I don't know how to describe this
situation clearly in a shorter way.
Can you switch the databases to read-only, shutdown, disconnect the
first disk, boot, backup the database(s), shutdown, connect the first
disk, disconnect the second disk, boot, backup the database(s),
shutdown, and connect the second disk?˙ If so, you could then restore
those backups, and the last known good backups, (with different names, read-only
) and trouble-shoot.˙ It may or may not be possible to identify
the newest data and implement queries/ scripts to do the three-way merge.
That is also my understanding. The HDD's I own are proprietary, so the engineering documentation is unavailable. But, I seem to recall reading an article on the WWW stating that HDD's store additional data in hidden blocks on the media to allow detecting and/or correcting several flipped bits.Much more than that. They do accept some error rate which is corrected
I've always thought that the hardware controller checksums raw disk
blocks (sectors) as part of the low-level I/O processing in the
controller hardware's "firmware" and that this is how the controller
knows it has a bad block.
I just noticed that I didn't manage to make clear that (1) I don't think that there is one specific failed partition, but that both partitions containing databases seem to work, but not at the same time, and that (2) I want to keep the data on the seemingly failed device.[...]
On 6/20/26 13:01, debian-user@howorth.org.uk wrote:
David Christensen <dpchrist@holgerdanske.com> wrote:
AIUI neither XFS nor mdadm compute, store, or verify checksums of data or metadata on disk.
XFS checksums metadata for many years now (12?), but it doesn't checksum user data.
Thank you for the clarification:
https://wiki.archlinux.org/title/XFS#Checksumming
On 6/20/26 13:26, Robert Heller wrote:
I've always thought that the hardware controller checksums raw disk blocks (sectors) as part of the low-level I/O processing in the controller hardware's "firmware" and that this is how the controller
knows it has a bad block.
That is also my understanding.˙ The HDD's I own are proprietary, so the engineering documentation is unavailable.˙ But, I seem to recall reading an article on the WWW stating that HDD's store additional data in hidden blocks on the media to allow detecting and/or correcting several flipped bits.
On 6/20/26 16:23, Paul Leiber wrote:
I just noticed that I didn't manage to make clear that (1) I don't think that there is one specific failed partition, but that both partitions containing databases seem to work, but not at the same time, and that (2) I want to keep the data on the seemingly failed device.
Let me explain with an example: I am using KODI to access my video data from different devices. A couple of months ago, I switched KODI to using a centralized database (containing metadata information on movies, watch status, etc.) in order to maintain only one database instead of a database on each device running KODI. The data is stored on the database VM, running MariaDB, which stores the data on the md software RAID1 (at least that's what was supposed to happen). I spent some time configuring the metadata, e.g. correcting mistakes in the movie titles etc. I then noticed that I mistakenly selected English language to display the movie descriptions. Because of family members not fluent in English, I redid the metadata configuration in German. (It was an annoying work, therefore I remember it well.) Then, some time later, after a reboot of the hypervisor (and the database VM) due to kernel updates, the
language of the movie descriptions was displayed in English again. I attributed this to a corrupt database after the database VM reboot and loaded a database backup from some time ago, where the movie description was still in English. So I did the metadata configuration a third time, again in German. (I guess you can imagine the fun I had.) Then, a couple of days ago, after a reboot of the hypervisor and the database VM, the KODI movie description was displayed in English again. That's when I really started digging, because now it was clear that there were actually two intact, but differing databases. (To be clear: There were some other changes to other databases that also were affected in a similar manner which I don't mention in this example, so this issue is not restricted to the KODI database).
Based on the available data, I attribute this issue to the RAID1 which seems to select one of two partitions at random when booting the hypervisor. Indications: - The last update time in the description of the (seemingly) failed device given by mdadm -- examine match the point in time of the switch from one database version ("German") to the other ("English"), therefore I assume that the switch happens at the software RAID level. - A failure at hardware level doesn't seem likely, because how could there suddenly be an older version of a database available in a RAID1 if one device fails and the RAID1 is degraded, and this after entirely rebuilding the database from a backup? And, mind you, this switch to an older version of the database didn't happen just once, but at least two times. The data (in English) simply shouldn't have been available anymore at this point if the RAID1 had been working as intended.
The most likely explanation to me is that the RAID1 has been running in a degraded state for some time (unnoticed by me), the database changes (e. g. from English to German) were stored to just one of the two partitions, and at some point the RAID1 switched to the other partition after a reboot, containing intact, but older (e. g. English) data. As a defective hardware doesn't seem likely, I assume that something in my setup causes this behaviour by md. But of course, I might be wrong and I am open to other explanations. For example, what my assumption fails to explain is why the switch only happens from time to time, and not more often, e.g. after each reboot.
The example you kindly give is for removing a seemingly failed partition (currently dm-30, "German" database) from a md RAID1, keeping the data on an intact partition (currently dm-31, "English" database) and than re-fgadding a partition to the RAID1. This is pretty straightforward: the data is kept and replicated from the valid partition to the freshly added one. However, in my case, the dataset I want to keep is on the seemingly failed partition not used in the RAID (currently dm-30, "German").
Options I see (besides recreating the RAID1 from scratch and using an available backup to restore the data, losing some data):
1. I could fail the seemingly intact partition or remove the RAID1 entirely, somehow use the seemingly failed partition (dm-30, "German") to create a new RAID without losing the data on it, then add the other partition (dm-31) as a new drive and have the data replicated. I am not sure if this is possible, therefore my question to this list.
2. Another option is to reboot the hypervisor and hope for a switch of the RAID to the partition containing the more recent version of the database, then follow your guide. But I am not really confident that such a "strategy" is the best choice I have at the moment. Also, I just tried a reboot three times, each time the data in the database is the wrong, old one.
3. I could also backup the database from the seemingly failed partition in order to not lose data and then use this backup to recreate the RAID1, but I would need to mount that partition, which ended in an error when I tried it.
And, of course, I don't want this to happen again, therefore I want to find the root cause for this situation and fix it. If it is not the missing information in /etc/mdadm/mdadm.conf, what else could it be?
Sorry for the lengthy posts, I don't know how to describe this situation clearly in a shorter way.
Do you own a power supply tester?˙ If not, I suggest buying one. When you have one, test your power supply.˙ It is possible for one rail to fail (e.g. +12 VDC), the computer to boot, and all or part of the computer to operate incorrectly.˙ Without a power supply tester, you will be chasing seemingly random errors until the power supply goes to 100% failure and/or you damage/ destroy other hardware.
Does your computer have ECC memory?˙ If not, I suggest getting a computer with ECC memory.˙ In any case, I suggest testing your memory with memtest86+ for 24 hours.
Have you tested your hard disks?˙ If not, I suggest running smartctl(8) "--test long".˙ When testing is done, view the results with "--xall".
Have you validated the filesystem with fsck.xfs(8)?˙ If not, I suggest doing so.
Do you have streams of database transactions since the last known good backups?˙ If so, can they be replayed?
Can you switch the databases to read-only, shutdown, disconnect the first disk, boot, backup the database(s), shutdown, connect the first disk, disconnect the second disk, boot, backup the database(s), shutdown, and connect the second disk?˙ If so, you could then restore those backups, and the last known good backups, (with different names, read-only) and trouble-shoot.˙ It may or may not be possible to identify the newest data and implement queries/ scripts to do the three-way merge.
On 2026-06-21, Paul Leiber wrote:
I just noticed that I didn't manage to make clear that (1) I don't think that[...]
there is one specific failed partition, but that both partitions containing >> databases seem to work, but not at the same time, and that (2) I want to keep
the data on the seemingly failed device.
I used raid for tens of years and never had the problem you described. So
I believe it should come from something else than raid failure. raid duplicates data on both partitions. If you don't have a failed partition
the duplication is done transparently and quickly (<second). When a
partition fails you can add a new and clean partition and the original
one, the partition still in raid, is synchronized into it. As others told
you you should use smartd to monitor your disks.
Am 20.06.26 um 20:00 schrieb Charles Curley:
On Sat, 20 Jun 2026 18:01:15 +0200
Paul Leiber <paul@onlineschubla.de> wrote:
Somehow, I missed to include the RAID1 information for md0 to theI doubt that this is the culprit. the man page for mdadm says, in part:
configuration file (e. g. by entering root@localhost:~# mdadm
--detail --scan /dev/md0 >> /etc/mdadm/mdadm.conf). I am not sure if
this actually is the cause and adding that information would solve
the issue.
My questions are the following:
1. Is my analysis valid in principle? Especially: Could the root
cause for this issue be that mdadm.conf is missing the information
for md0, and could adding that information prevent data loss or
inconsistencies in the future?
˙˙˙˙˙ Assemble
˙˙˙˙˙˙˙˙˙˙˙˙˙˙ Assemble the components of a previously created array
˙˙˙˙˙ into an active array.˙ Components can be explicitly given or can
˙˙˙˙˙ be searched for.˙ mdadm checks˙ that˙ the components do form a
˙˙˙˙˙ bona fide array, and can, on request, fiddle superblock
˙˙˙˙˙ information so as to assemble a faulty array.
So mdadm *should* find both devices. But it might not be. And adding
that line will not hurt. I have a similar line in my mdadm.conf.
I built my RAID array up a bit differently that you did yours. You made
your partitions, put LUKS on the partitions, then the RAID on top of
that. I have the partitions, then the RAID array, LUKS on top of that,
then LVM, with file systems on top of the LVs. But I know of no reason
your setup shouldn't work.
I have found that when I have multiple LUKS partitions, giving them
all the same passphrase means I need give only one passphrase to
decrypt on boot.
2. Can I (re)create the RAID1 md0 or (re-)add the missing partitionYes. For the gory details see
in an easy way that no or at least not all information is lost? If
yes, how?
https://oneuptime.com/blog/post/2026-03-02-how-to-replace-a-failed-disk-in-mdadm-raid-on-ubuntu/view.
In short,
* Fail the offending disk. It looks like this has already happened, but
˙˙ it shouldn't hurt to do it again.
* Remove the disk from the array.
* Add the disk back in again. This should trigger rebuilding, which
˙˙ takes a while. During the rebuild, the data should be both readable
˙˙ and writable. You may monitor with:
˙˙ cat /proc/mdstat
I just noticed that I didn't manage to make clear that (1) I don't think that there is one specific failed partition, but that both partitions containing databases seem to work, but not at the same time, and that (2) I want to keep the data on the seemingly failed device.
Let me explain with an example: I am using KODI to access my video data from different devices. A couple of months ago, I switched KODI to using a centralized database (containing metadata information on movies, watch status, etc.) in order to maintain only one database instead of a database on each device running KODI. The data is stored on the database VM, running MariaDB, which stores the data on the md software RAID1 (at least that's what was supposed to happen). I spent some time configuring the metadata, e.g. correcting mistakes in the movie titles etc. I then noticed that I mistakenly selected English language to display the movie descriptions. Because of family members not fluent in English, I redid the metadata configuration in German. (It was an annoying work, therefore I remember it well.) Then, some time later, after a reboot of the hypervisor (and the database VM) due to kernel updates, the language of the movie descriptions was displayed in English again. I
attributed this to a corrupt database after the database VM reboot and loaded a database backup from some time ago, where the movie description was still in English. So I did the metadata configuration a third time, again in German. (I guess you can imagine the fun I had.) Then, a couple of days ago, after a reboot of the hypervisor and the database VM, the KODI movie description was displayed in English again. That's when I really started digging, because now it was clear that there were actually two intact, but differing databases. (To be clear: There were some other changes to other databases that also were affected in a similar manner which I don't mention in this example, so this issue is not restricted to the KODI database).
Based on the available data, I attribute this issue to the RAID1 which seems to select one of two partitions at random when booting the hypervisor. Indications:
- The last update time in the description of the (seemingly) failed device given by mdadm --examine match the point in time of the switch from one database version ("German") to the other ("English"), therefore I assume that the switch happens at the software RAID level.
- A failure at hardware level doesn't seem likely, because how could there suddenly be an older version of a database available in a RAID1 if one device fails and the RAID1 is degraded, and this after entirely rebuilding the database from a backup? And, mind you, this switch to an older version of the database didn't happen just once, but at least two times. The data (in English) simply shouldn't have been available anymore at this point if the RAID1 had been working as intended.
The most likely explanation to me is that the RAID1 has been running in a degraded state for some time (unnoticed by me), the database changes (e. g. from English to German) were stored to just one of the two partitions, and at some point the RAID1 switched to the other partition after a reboot, containing intact, but older (e. g. English) data. As a defective hardware doesn't seem likely, I assume that something in my setup causes this behaviour by md. But of course, I might be wrong and I am open to other explanations. For example, what my assumption fails to explain is why the switch only happens from time to time, and not more often, e.g. after each reboot.
The example you kindly give is for removing a seemingly failed partition (currently dm-30, "German" database) from a md RAID1, keeping the data on an intact partition (currently dm-31, "English" database) and than re-fgadding a partition to the RAID1. This is pretty straightforward: the data is kept and replicated from the valid partition to the freshly added one. However, in my case, the dataset I want to keep is on the seemingly failed partition not used in the RAID (currently dm-30, "German").
Options I see (besides recreating the RAID1 from scratch and using an available backup to restore the data, losing some data):
1. I could fail the seemingly intact partition or remove the RAID1 entirely, somehow use the seemingly failed partition (dm-30, "German") to create a new RAID without losing the data on it, then add the other partition (dm-31) as a new drive and have the data replicated. I am not sure if this is possible, therefore my question to this list.
2. Another option is to reboot the hypervisor and hope for a switch of the RAID to the partition containing the more recent version of the database, then follow your guide. But I am not really confident that such a "strategy" is the best choice I have at the moment. Also, I just tried a reboot three times, each time the data in the database is the wrong, old one.
3. I could also backup the database from the seemingly failed partition in order to not lose data and then use this backup to recreate the RAID1, but I would need to mount that partition, which ended in an error when I tried it.
And, of course, I don't want this to happen again, therefore I want to find the root cause for this situation and fix it. If it is not the missing information in /etc/mdadm/mdadm.conf, what else could it be?
Sorry for the lengthy posts, I don't know how to describe this situation clearly in a shorter way.
My knowledge in IT is limited. I just can describe what I can observe and make
guesses. (The md RAID is part of a setup I do for fun at home.) I know that it
sounds strange, but my best guess is that there are two differing databases stored on my hard drives. How else can the repeated switch between different data sets be explained?
I managed to rebuild the md RAID1 using the data on the seemingly failed device (partition_1). First, I did a dd dump of partition_2 (currently in usage) in order not to lose data. Then, I recreated the RAID1 using the data from partition_1:
mdadm --stop /dev/md0 # This stops the degraded RAID1
mdadm --assemble --update=uuid /dev/md0 /dev/mapper/partition_1 # This creates a new RAID1 using the partition_1, a new array UUID is required in order for --assembly to work
mdadm --manage --add /dev/md0 /dev/mapper/partition_2 # This adds partition_2 to the RAID1, contents of partition_1 are replicated to partition_2 automatically
On 2026-06-21, Paul Leiber wrote:If I have been following along, the RAID parts are LUKS encrypted devices,
I managed to rebuild the md RAID1 using the data on the seemingly failed device (partition_1). First, I did a dd dump of partition_2 (currently in usage) in order not to lose data. Then, I recreated the RAID1 using the data
from partition_1:
mdadm --stop /dev/md0 # This stops the degraded RAID1
mdadm --assemble --update=uuid /dev/md0 /dev/mapper/partition_1 # This creates a new RAID1 using the partition_1, a new array UUID is required in order for --assembly to work
mdadm --manage --add /dev/md0 /dev/mapper/partition_2 # This adds partition_2 to the RAID1, contents of partition_1 are replicated to partition_2 automatically
Your commands are strange : the partitions should be the disk partitions
from /dev and not mapped ones. Or you have another layer ? From where
come /dev/mapper/partition_1 and /dev/mapper/partition_2 ?
Your commands are strange : the partitions should be the disk partitions
from /dev and not mapped ones. Or you have another layer ? From where
come /dev/mapper/partition_1 and /dev/mapper/partition_2 ?
If I have been following along, the RAID parts are LUKS encrypted devices,
so to me it does make sense.
On 2026-06-22, tomas@tuxteam.de wrote:"Is always" means for you "should always be" or "has to be"?
Your commands are strange : the partitions should be the disk partitions >> from /dev and not mapped ones. Or you have another layer ? From where
come /dev/mapper/partition_1 and /dev/mapper/partition_2 ?
If I have been following along, the RAID parts are LUKS encrypted devices, so to me it does make sense.
RAID is always before LUKS : partition > RAID array > LUKS > filesystem
On 2026-06-21, Paul Leiber wrote:The partitions are LUKS encrpyted and hence decrypted before being assembled into the RAID array. Thus the mapped drives.
I managed to rebuild the md RAID1 using the data on the seemingly failedYour commands are strange : the partitions should be the disk partitions
device (partition_1). First, I did a dd dump of partition_2 (currently in
usage) in order not to lose data. Then, I recreated the RAID1 using the data >> from partition_1:
mdadm --stop /dev/md0 # This stops the degraded RAID1
mdadm --assemble --update=uuid /dev/md0 /dev/mapper/partition_1 # This creates a new RAID1 using the partition_1, a new array UUID is required in order for --assembly to work
mdadm --manage --add /dev/md0 /dev/mapper/partition_2 # This adds partition_2 to the RAID1, contents of partition_1 are replicated to partition_2 automatically
from /dev and not mapped ones. Or you have another layer ? From where
come /dev/mapper/partition_1 and /dev/mapper/partition_2 ?
Beside this it is much quicker and safer to go this way :
- do not stop the md array (and thus the assemble is not needed)
- remove the failed partition
mdadm --manage /dev/md0 --remove "failed partition"
- add the new clean partition
mdadm --manage /dev/md0 --add "good partition"
- and let mdadm sync the array
On Mon, Jun 22, 2026 at 08:18:45AM +0200, Michel Verdier wrote:
On 2026-06-22, tomas@tuxteam.de wrote:"Is always" means for you "should always be" or "has to be"?
RAID is always before LUKS : partition > RAID array > LUKS > filesystemYour commands are strange : the partitions should be the disk partitions >>>> from /dev and not mapped ones. Or you have another layer ? From whereIf I have been following along, the RAID parts are LUKS encrypted devices, >>> so to me it does make sense.
come /dev/mapper/partition_1 and /dev/mapper/partition_2 ?
As far as I understand OP, their case is the other way around (and I don't see why it shouldn't be technically possible: a block device is a block device is a block device, after all).
Am 21.06.26 um 06:47 schrieb David Christensen:
Do you own a power supply tester?
... I will put a test of
the power supply on the list of things to try.
Does your computer have ECC memory?
Yes, it does have ECC memory.
I will put the memory test on the list as well.
Have you tested your hard disks?
Both disks are monitored via smartctl. Automated short and long tests
are being done regularly. There are no indications for hardware failure
in the smart data.
Have you validated the filesystem with fsck.xfs(8)?
I just did a check of the file system (using xfs_repair -n), with no
errors reported.
Do you have streams of database transactions since the last known good
backups?˙ If so, can they be replayed?
Not exactly knowing what such a stream is, I guess I don't have one. But
I am not sure. Will check.
Can you switch the databases to read-only, shutdown, disconnect the
first disk, boot, backup the database(s), shutdown, connect the first
disk, disconnect the second disk, boot, backup the database(s),
shutdown, and connect the second disk?
That's a good suggestion. I will need to check what booting with just
one disk could do to the BTRFS filesystem, but this might be a way to
force md to use the disk which is currently indicated as failed.
I managed to rebuild the md RAID1 using the data on the seemingly
failed device (partition_1). First, I did a dd dump of partition_2 (currently in usage) in order not to lose data. Then, I recreated the
RAID1 using the data from partition_1:
mdadm --stop /dev/md0 # This stops the degraded RAID1
mdadm --assemble --update=uuid /dev/md0 /dev/mapper/partition_1 # This creates a new RAID1 using the partition_1, a new array UUID is
required
in order for --assembly to work
mdadm --manage --add /dev/md0 /dev/mapper/partition_2 # This adds partition_2 to the RAID1, contents of partition_1 are replicated to partition_2 automatically
md is currently replicating the data (German movie descriptions in
Kodi, yay!) from partition_1 to partition_2. I might have to turn partition_2 from "spare" to "active", but I'll let the replication
complete first.
In any case, I set up mdmonitor to alert me if the RAID1 degrades
again. That's something I should have thought of earlier.
We'll see if this issue occurs again. I'll give an update if this is
the case.
Thanks to everybody for trying to help me!
Paul
RAID is always before LUKS : partition > RAID array > LUKS >
filesystem
"Is always" means for you "should always be" or "has to be"?
As far as I understand OP, their case is the other way around (and I
don't see why it shouldn't be technically possible: a block device
is a block device is a block device, after all).
Tomas' description of my setup is correct, LUKS before RAID. It has
been working in the past, and it is working right now again. Is this
type of setup recommended? I don't know. BTRFS doesn't show any
issues with this setup.
Hi,
The lack of an mdadm.conf should not cause you any issues. It's only
really used to set non-default options, give a monitoring email address
and so on.
udev incrementally assembles MDADM arrays as devices appear. It does notThe switch of active and inactive devices happened definitely more than once. The md array most likely switched between the LUKS devices at boot several times, therefore the different event counts. The device with newer data of course had the higher event count, as it was the one the data had been written on in the weeks before the latest switch. My best guess is also that something happened while the LUKS devices have been created which made md believe that one device is not intact or available.
need any configuration to do this. In order to end up in the situation
OP is in, I can only imagine that they rebooted and only one of the
LUKS devices was set up, so md0 proceeded in a degraded fashion with
that device.
What is confusing to me is how OP had an active mdadm array member with
an event count significantly *behind* the inactive one. It makes me
think that this may have happened more than once, with different single
LUKS devices being activated each time.
The mdadm monitor daemon runs by default and should email you about
degraded arrays. Without any configuration that would be sending to root@localhost. OP should make sure that these emails will arrive
somewhere useful, or look into other ways of checking status of mdadm
arrays. What's happened here was likely trivial to fix at the
time of first problem but became a complete nightmare that likely
involved data loss (OP has backup of a device with unique data that
cannot be integrated).
OP, after sorting out the monitoring I think you need to verify thatmdmon is now set up, I tested that e-mails actually arrive.Latest news: The md array survived a first reboot today.
both LUKS devices are always successfully unlocked and available at boot
so that the RAID 1 assembles fully and properly.
I think it's unlikely that you have had a hardware failure of theO.k.
underlying drives, though you should of course check your logs and
smartctl for that. Given that LUKS is in use and is the most complicated thing in your storage stack, I'd be looking into whether both LUKS
devices are being reliably created.
If setting this system up from scratch my preference would be to do the redundancy as near to the hardware as possible and the encryption as far
away as possible. So I'd put LUKS on md0, not md0 on two LUKS devices.
my main suspect for the root cause of the dual-head database
RAID is always before LUKS : partition > RAID array > LUKS > filesystem
"Is always" means for you "should always be" or "has to be"?
As far as I understand OP, their case is the other way around (and I don't see why it shouldn't be technically possible: a block device is a block device is a block device, after all).
RAID is always before LUKS : partition > RAID array > LUKS > filesystem"Is always" means for you "should always be" or "has to be"?
As far as I understand OP, their case is the other way around (and I don't >> see why it shouldn't be technically possible: a block device is a block
device is a block device, after all).
Tomas' description of my setup is correct, LUKS before RAID. It has
been working in the past, and it is working right now again. Is this
type of setup recommended? I don't know. BTRFS doesn't show any issues
with this setup.
However, my main suspect for the root cause of the dual-head database is indeed that the LUKS decryption messes with the md RAID assembly at boot,
e. g. some timing issue or race condition. The database content doesn't change
constantly, there are very few writes per day, so I'll rely on daily backups and monitor the RAID closely. There was another kernel update today, so I'll see what happens after a reboot, which probably was what triggered the issue in the past. If another issue occurs, I'll probably have a chance to find more
information in the logs now that I know what to look for. If my assumption is confirmed, I'll change the order to RAID before LUKS and restore the data from
backup. (Or I'll do it anyway out of lack of other ideas...)
On 2026-06-22, tomas@tuxteam.de wrote:I think that is wrong. You don't sync the *encrypted* partitions (how would you?) but the decrypted block layer, one level up. I don't see a reason it wouldn't work.
RAID is always before LUKS : partition > RAID array > LUKS > filesystem
"Is always" means for you "should always be" or "has to be"?
"has to be". LUKS encrypt a partition in a unique way. So 2 encrypted partitions are always different and cannot be synced.
OP's initial description was (to me) so clear that I think I understoodAs far as I understand OP, their case is the other way around (and I don't see why it shouldn't be technically possible: a block device is a block device is a block device, after all).
Perhaps the problem but I don't have enough informations on its
installation.
On 2026-06-22, Paul Leiber wrote:I strongly disagree here.
RAID is always before LUKS : partition > RAID array > LUKS > filesystem >> "Is always" means for you "should always be" or "has to be"?
As far as I understand OP, their case is the other way around (and I don't >> see why it shouldn't be technically possible: a block device is a block
device is a block device, after all).
Tomas' description of my setup is correct, LUKS before RAID. It has
been working in the past, and it is working right now again. Is this
type of setup recommended? I don't know. BTRFS doesn't show any issues
with this setup.
So Tomas found your problem. It is at best useless to have
partition > LUKS > RAID array > filesystem
I cannot see how it managed to work. It suppose the 2 LUKS are identical which is a nonsense. Also a small change in data gives a bigger change inNot the LUKS are identical. Their decrypted layers are, ideally. Of
a LUKS partition thus bigger to sync. I don't know enough about LUKS but
I suppose you loose LUKS atomicity during sync.
On Tue, Jun 23, 2026 at 08:00:11AM +0200, Michel Verdier wrote:
On 2026-06-22, tomas@tuxteam.de wrote:I think that is wrong. You don't sync the *encrypted* partitions (how would you?) but the decrypted block layer, one level up. I don't see a reason it wouldn't work.
"has to be". LUKS encrypt a partition in a unique way. So 2 encryptedRAID is always before LUKS : partition > RAID array > LUKS > filesystem >>> "Is always" means for you "should always be" or "has to be"?
partitions are always different and cannot be synced.
When you have a layer that combines RAID, volume management, and filesystems, such as ZFS and btrfs, the stackable encryption layer must be underneath (e.g. the latter of above two I/O layering configurations).
Is your OS on the btrfs mirror?˙ I have found that putting the OS on a dedicated SSD makes operations, maintenance, trouble-shooting, disaster preparedness/ recovery, etc., much easier.
On Tue, Jun 23, 2026 at 08:00:11AM +0200, Michel Verdier wrote:
On 2026-06-22, tomas@tuxteam.de wrote:
On 6/21/26 23:18, Michel Verdier wrote:
RAID is always before LUKS : partition > RAID array > LUKS > filesystem >>>"Is always" means for you "should always be" or "has to be"?
"has to be". LUKS encrypt a partition in a unique way. So 2 encrypted
partitions are always different and cannot be synced.
I think that is wrong.
Both configurations work, but have different performance and security considerations:Actually it would reduce security, IMO, because the opponent would have
* partitions > RAID > encryption > filesystem
Will encrypt the RAID virtual block device, saving CPU cycles and requiring one passphrase and/or key.
* partitions > encryption > RAID > filesystem
Will encrypt each partition, arguably improving security but requiring more CPU cycles and passphrases/ keys.
On Tue, Jun 23, 2026 at 10:33:01AM -0700, David Christensen wrote:
[...]
Both configurations work, but have different performance and security
considerations:
* partitions > RAID > encryption > filesystem
Will encrypt the RAID virtual block device, saving CPU cycles and
requiring one passphrase and/or key.
* partitions > encryption > RAID > filesystem
Will encrypt each partition, arguably improving security but requiring >> more CPU cycles and passphrases/ keys.
Actually it would reduce security, IMO, because the opponent would have
to find just one of both keys (the content is mirrored), thus potentially reducing the key strength by one bit. Not a big deal, granted :)
Cheers
On 6/23/26 11:10, tomas@tuxteam.de wrote:
On Tue, Jun 23, 2026 at 10:33:01AM -0700, David Christensen wrote:
[...]
Both configurations work, but have different performance and security considerations:
* partitions > RAID > encryption > filesystem
Will encrypt the RAID virtual block device, saving CPU cycles and requiring one passphrase and/or key.
* partitions > encryption > RAID > filesystem
Will encrypt each partition, arguably improving security but requiring
more CPU cycles and passphrases/ keys.
Actually it would reduce security, IMO, because the opponent would have
to find just one of both keys (the content is mirrored), thus potentially reducing the key strength by one bit. Not a big deal, granted :)
Cheers
I agree that successfully cracking two or more disks from an encrypted RAID will give an attacker greater confidence in the resulting data and metadata.No, no: I meant the attacker has to crack *just one of two*, thus
But I would expect a cracking algorithm for an encryption layer with on-disk cryptographic details (e.g. LUKS header) would primarily attack thoseNow that makes sense to me: space ? time is constant, you double the
on-disk cryptographic details:
* Assuming a brute-force cracking algorithm, each crack attempt (e.g. passphrase and/or key generated by an iterator) is an independent trial and the work is readily partitioned across multiple computers working in parallel. So, cracking 1 LUKS header with N computers will take the same average time as cracking any one of 2 to N different LUKS headers with N computers.
* What an attacker wants is a cracking algorithm where each new cracking attempt leverages the results from previous failed attempts. AIUI LUKS, dm-crypt, and other professional cryptographic systems are specifically designed to thwart such. But if you design such an algorithm, you could become famous, make money, become an enemy of the state, go to prison, flee into exile, etc..I'd expect that, yes. Current attacks seem to concentrate on the PBKDF,
I was thinking of what happens if a disk fails, the sysadmin disposes of the disk, an attacker obtains the disk, and the attacker successfully cracks the encryption. The attacker now has all or part of the plaintext data, the plaintext metadata, and the plaintext cryptographic details at the time the disk failed:Never do that. If the electronics still work to dd to the first sectors
* If encryption was applied on top of RAID and the attacker obtains a second encrypted disk, the attacker can use the plaintext cryptographic detailsWhich you don't need to, since we are talking RAID1, and they should
from the first disk to crack the second disk. This could be as simple as entering the passphrase and/or key from the first disk.
* If encryption was applied under RAID and the sysadmin used different
strong passphrases and/or keys on every disk, the plaintext cryptographic details from any one cracked disk will not help to crack additional
encrypted disks.
On Tue, Jun 23, 2026 at 03:04:38PM -0700, David Christensen wrote:
I was thinking of what happens if a disk fails, the sysadmin disposes of the >> disk, an attacker obtains the disk, and the attacker successfully cracks the >> encryption. The attacker now has all or part of the plaintext data, the
plaintext metadata, and the plaintext cryptographic details at the time the >> disk failed:
Never do that. If the electronics still work to dd to the first sectors
of the disk, by all means, do.
I have heard of disk shredding and/or incineration services, but that is above my scale.
| Sysop: | Jacob Catayoc |
|---|---|
| Location: | Pasay City, Metro Manila, Philippines |
| Users: | 4 |
| Nodes: | 4 (0 / 4) |
| Uptime: | 495145:22:59 |
| Calls: | 165 |
| Files: | 574 |
| D/L today: |
29 files (9,998K bytes) |
| Messages: | 78,198 |