Debugging a Rocky Linux Boot Nightmare: When GRUB Hides Configuration in Unexpected Places

A deep dive into troubleshooting persistent GRUB boot issues after SSD migration

Background: The Migration Process

My Rocky Linux 9.4 server was running on a traditional HDD, and I decided to upgrade to a faster NVMe SSD. Rather than doing a clean install, I wanted to migrate the existing system to preserve all configurations and data.

How I Migrated with Claude Code’s Help

I used Claude Code to guide me through the migration process. Here’s what we did:

Step 1: Partition the new SSD

Claude helped me create the partition scheme on the new NVMe drive:

# Create GPT partition table
sudo parted /dev/nvme0n1 mklabel gpt

# Create EFI partition (512MB)
sudo parted /dev/nvme0n1 mkpart primary fat32 1MiB 513MiB
sudo parted /dev/nvme0n1 set 1 esp on

# Create swap partition (16GB)
sudo parted /dev/nvme0n1 mkpart primary linux-swap 513MiB 16.5GiB

# Create root partition (remaining space)
sudo parted /dev/nvme0n1 mkpart primary ext4 16.5GiB 100%
ShellScript

Step 2: Format the partitions

# Format EFI partition
sudo mkfs.vfat -F32 /dev/nvme0n1p1

# Format swap
sudo mkswap /dev/nvme0n1p2

# Format root partition
sudo mkfs.ext4 /dev/nvme0n1p3
ShellScript

Step 3: Use rsync to copy the system

This was the critical part. Claude helped me craft the right rsync command to copy everything while preserving permissions, attributes, and excluding unnecessary directories:

# Mount the new SSD
sudo mkdir /mnt/newssd
sudo mount /dev/nvme0n1p3 /mnt/newssd

# Mount EFI partition
sudo mkdir /mnt/newssd/boot/efi
sudo mount /dev/nvme0n1p1 /mnt/newssd/boot/efi

# Use rsync to copy the entire system
sudo rsync -aAXHv --exclude={"/dev/*","/proc/*","/sys/*","/tmp/*","/run/*","/mnt/*","/media/*","/lost+found"} / /mnt/newssd/
ShellScript

The rsync flags used:

  • -a: Archive mode (preserves permissions, timestamps, symbolic links, etc.)
  • -A: Preserve ACLs (Access Control Lists)
  • -X: Preserve extended attributes
  • -H: Preserve hard links
  • -v: Verbose output

Step 4: Update fstab and reinstall GRUB

# Chroot into the new system
sudo mount --bind /dev /mnt/newssd/dev
sudo mount --bind /proc /mnt/newssd/proc
sudo mount --bind /sys /mnt/newssd/sys
sudo chroot /mnt/newssd

# Get new UUIDs
blkid

# Update /etc/fstab with new UUIDs
# (Claude helped me edit this correctly)

# Reinstall GRUB to the new disk
grub2-install --target=x86_64-efi --efi-directory=/boot/efi --bootloader-id=rocky /dev/nvme0n1

# Regenerate GRUB configuration
grub2-mkconfig -o /boot/grub2/grub.cfg
grub2-mkconfig -o /boot/efi/EFI/rocky/grub.cfg

# Exit chroot and reboot
exit
sudo reboot
ShellScript

What went wrong: Despite following all these steps correctly, the system still tried to boot with the old HDD’s UUID. That’s where the real troubleshooting began.

The Problem

After the migration, I encountered what seemed like a straightforward GRUB configuration issue. The system would boot, but only after manually editing the boot parameters at the GRUB menu every single time. The error? GRUB was trying to use the old HDD’s UUID instead of the new SSD’s UUID.

Symptoms:

  • Had to press ‘e’ at GRUB menu and manually remove the old UUID
  • Boot would fail or hang in dracut emergency shell without manual intervention
  • System showed root=UUID=bf6b9071-263a-49ec-bb6a-721da27a4d8c (old HDD) instead of the correct root=UUID=8d4ad011-ae38-4c14-9589-975b4bae5405 (new SSD)

Seems simple enough to fix, right? Just update the configuration files. Wrong.

The Investigation: A Journey Through GRUB’s Labyrinth

Phase 1: The Obvious Fixes (That Didn’t Work)

I started with the standard GRUB troubleshooting steps:

# Remove resume parameter from grub defaults
sudo sed -i 's/ resume=UUID=[^ ]*//g' /etc/default/grub
sudo grubby --update-kernel=ALL --remove-args='resume'

# Regenerate GRUB configuration
sudo grub2-mkconfig -o /boot/grub2/grub.cfg

# Regenerate initramfs
sudo dracut --force --regenerate-all
ShellScript

Result: No change. GRUB menu still showed the old UUID.

Phase 2: Deep Dive into BLS Entries

Rocky Linux uses Boot Loader Specification (BLS) entries stored in /boot/loader/entries/. I verified all entries:

grep '^options' /boot/loader/entries/*.conf

Everything looked correct! Each entry showed the proper UUID. So I tried hardcoding the values:

# Hardcode correct root UUID in all BLS entries
sudo sed -i 's|^options .*|options root=UUID=8d4ad011-ae38-4c14-9589-975b4bae5405 ro rhgb quiet|' \
  /boot/loader/entries/*.conf

Result: Still no change after reboot.

Phase 3: The grubenv Investigation

GRUB uses grubenv files to store boot environment variables. I updated both locations:

# Update main grubenv
sudo grub2-editenv /boot/grub2/grubenv set \
   'kernelopts=root=UUID=8d4ad011-ae38-4c14-9589-975b4bae5405 ro rhgb quiet'

# Copy to EFI partition
sudo cp /boot/grub2/grubenv /boot/efi/EFI/rocky/grubenv

Result: Old UUID persisted in GRUB menu.

Phase 4: The Mystery Deepens

At this point, I performed an exhaustive search:

# Search ALL files in /boot for the old UUID
sudo find /boot -type f -exec grep -l 'bf6b9071-263a-49ec-bb6a-721da27a4d8c' {} \;
ShellScript

Finding: The old UUID appeared in initramfs images (kdump configuration), but that shouldn’t affect GRUB’s boot menu.

I even found and deleted an old backup file:

sudo rm /boot/efi/EFI/rocky/grub.cfg.rpmsave.OLD
ShellScript

Result: Still no change.

The Breakthrough: Hidden Configuration Directories

After hours of troubleshooting, I discovered something unexpected. The OS migration had created multiple GRUB configuration directories scattered across the filesystem:

Discovery #1: /grub2/ in the Root Directory

$ ls /grub2/
grub.cfg grubenv

$ cat /grub2/grub.cfg | grep kernelopts
set kernelopts="root=UUID=bf6b9071-263a-49ec-bb6a-721da27a4d8c..."
ShellScript

There it was! A complete GRUB configuration directory sitting at /grub2/ (not /boot/grub2/). This was a leftover from the migration.

Discovery #2: The Real Culprit – /loader/entries/

Here’s the smoking gun:

$ ls -la / | grep loader
drwxr-xr-x.   3 root root  4096 Jan  5 14:23 loader

$ grep '^options' /loader/entries/*.conf
options root=UUID=bf6b9071-263a-49ec-bb6a-721da27a4d8c resume=UUID=b970092e-0ff0-4b19-a7e7-1022b2205448 ro
ShellScript

This was the primary source GRUB was using!

When GRUB’s blscfg command searches for BLS entries, it checks multiple locations. Depending on the $prefix variable in the GRUB core image, it may find /loader/entries/ before /boot/loader/entries/.

The Fix

Once I found the real source, the fix was straightforward:

# Fix the REAL BLS entries location
sudo find /loader/entries/ -name '*.conf' \
   -exec sed -i 's/bf6b9071-263a-49ec-bb6a-721da27a4d8c/8d4ad011-ae38-4c14-9589-975b4bae5405/g' {} \;

# Remove problematic resume parameter
sudo find /loader/entries/ -name '*.conf' \
   -exec sed -i 's/ resume=UUID=[^ \"$]*//g' {} \;

# Also fix the other discovered location
sudo sed -i 's/bf6b9071-263a-49ec-bb6a-721da27a4d8c/8d4ad011-ae38-4c14-9589-975b4bae5405/g' /grub2/grub.cfg
sudo sed -i 's/ resume=UUID=[^ \"]*//g' /grub2/grub.cfg
ShellScript

Result: Success! System now boots without manual intervention.

Bonus Issue: The GUI Problem

After fixing the boot issue, I encountered another problem: GDM (GNOME Display Manager) showed “Oh no! Something has gone wrong” and refused to start.

Error in logs:

gnome-shell: symbol lookup error: /lib64/libmutter-8.so.0: undefined symbol: drmModeCloseFB

Root cause: The SSD migration had left libdrm outdated while Mesa/Mutter were updated.

# Check version
rpm -q libdrm
libdrm-2.4.117-1.el9.x86_64  # Too old!

# Fix
sudo dnf update libdrm -y
# Upgraded to libdrm-2.4.123

sudo systemctl restart gdm
ShellScript

GUI immediately started working.

Lessons Learned

  1. OS migrations create configuration sprawl: When migrating systems, old configuration files can end up in unexpected locations outside the standard paths.
  2. GRUB’s search path is complex: BLS configurations can exist in multiple locations (/loader/entries/, /boot/loader/entries/, etc.) and GRUB may prioritize them differently than expected.
  3. Always check the root directory: After a migration, don’t just check /boot/ – orphaned configuration directories like /grub2/ and /loader/ can interfere with the boot process.
  4. Library version mismatches matter: Graphics stack components (libdrm, mesa, mutter) need to stay in sync. Update all components together.
  5. Methodical searching pays off: Using find and grep to exhaustively search the filesystem eventually revealed the hidden configuration sources.

Key Takeaways for Troubleshooting GRUB

If you’re experiencing persistent GRUB issues after an OS migration:

  1. Check all possible configuration locations:find / -name “grub.cfg” 2>/dev/null
    find / -name “grubenv” 2>/dev/null
    find / -type d -name “loader” 2>/dev/null
  2. Search for old UUIDs everywhere:sudo grep -r “OLD_UUID” / 2>/dev/null | grep -v /proc | grep -v /sys
  3. Don’t assume standard paths – migrations create chaos
  4. Test each fix with a reboot – configuration caching is real

Conclusion

What started as a “simple” UUID update turned into a multi-hour investigation through GRUB’s configuration hierarchy. The root cause wasn’t incorrect configuration in the expected locations – it was hidden configuration in unexpected places created during the OS migration.

If you’re facing similar issues, I hope this post saves you some troubleshooting time. The key is persistence and methodical searching. The configuration causing your problem is somewhere on the filesystem – you just need to find it.


System Details:

  • Boot: UEFI with GRUB2 using BLS
  • Migration: HDD to NVMe SSD
  • Issue Duration: ~4 hours across multiple troubleshooting sessions
  • Final Status: Fully resolved

Questions or similar experiences? Feel free to reach out!


Posted

in

by

Tags:

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *

🧭