Case files of a TSE: How do you repair and rpm database? Very, very carefully.
How do you repair and rpm database? Very, very carefully.
This is an article which is part of a series that attempts to showcase the kind of work that SUSE Support does and how we help customers resolve issues they encounter when running SUSE products. The cases that are selected will be based on real cases. However, all details will be fully anonymized and stripped of identifying marks.
Some problems are very hard to identify, but the steps to solve that issue are quite simple. Other problems are relatively easy to identify, but are not trivial to solve. I will share a problem that seems quite simple on the surface, but there are many complexities to take care of.
The problem is simple
One of our European financial services customers had something unpleasant happen to them. After a patch of their lvm package, the system could not boot. The customer did the sensible thing and booted into a backup of the root filesystem that they had on hand. The customer raised a ticket with SUSE Support in an effort that this wouldn’t happen again. When I opened the supportconfig file I saw something like this:
mrs-geeko:~ # rpm -qa | grep lvm2 | grep -v grep lvm2-2.02.180-9.54.1.x86_64 lvm2-2.02.188-12.3.1.x86_64 lvm2-2.02.180-9.34.8.x86_64 lvm2-2.02.180-9.51.1.x86_64 lvm2-2.02.180-9.44.1.x86_64
Certainly not the way things should be. What we think happened is that the boot was messed up by these multiple conflicting packages. The reason this happened was not our focus and neither was the customers. They wanted to get those superfluous packages out.
Not everything goes to plan
The easy solution that presents itself. Of course the first the thing the customer then tried was to run a zypper up. This should remove the old packages and install the latest, no? This did not work. The customer tried to zypper rm the packages by name. This also does not work.
Well, we’re running out of options. Both a senior colleague who was keeping an eye on this case suspected an rpm database corruption from the very beginning. There are a number of reasons an rpm database can become corrupt, but again. Our focus wasn’t why this initially happened, but how to get out of the situation and make the system safe.
When the customer came back with the full logs, I noticed this in the zypper logs:
2021-10-03 13:24:32 mrs-geeko [zypp] Exception.cc(log):166 rpmdb2solv: inconsistent rpm database, key 1690 not found. run 'rpm --rebuilddb' to fix.
Aha! What an easy, easy fix. We simply run rpm –rebuilddb to fix it. It says so right there in the logs!
Not so fast. The customer has very important data on this system. They need to clear a change window to try this. This is completely understandable. After a couple of weeks we get the result that this operation and a zypper rm on the superfluous lvm packages has still not cleared up our issue.
Most customers would be getting a bit impatient, but our customer understood the whole time what a tricky case this was developing into and also that our interest was the same as the customers. Attempting safe operations, before unsafe ones. The reason we were trying these previous steps is that they were a lot safer than what we were about to do next.
Getting into the nitty-gritty with low level rpm commands
At SUSE Support we’ve often got plenty more bunnies in our hat, and we don’t give up easily. The commands which we were going to recommend to the customer to run were “low level” rpm commands rather than zypper commands. We would be removing the superfluous packages with the rpm -e command. This we hoped would force a cleanup of the database. Then we would perhaps have to force a reinstallation of the most recent lvm package as a last resort.
Straight away I informed the customer that while we considered the procedure relatively safe, that we should wait to run the procedure during a database downtime to minimize risk of data loss. I also instructed them to keep a backup of /etc/lvm/lvm.conf both locally and somewhere else.
We had the customer run a dry run with the –test parameter:
rpm -vv -e --allmatches --test lvm2 &> rpm_dry_run.txt
This is completely safe because of the test parameter and would just go through the motions of uninstalling the lvm2 packages without actually implementing the transaction. If that was ok, then we would run something like this for each superfluous package:
rpm -e --noscripts --notriggers lvm2-2.02.180-9.54.1.x86_64
This would make sure that we just removed the package rather than its dependencies. It would also not run postinstall scripts.
The dry run looks ok. We just tweak the command with and addition –nodeps option, so that the command doesn’t fail on dependencies. I instructed the customer what to watch out for once the erasing steps were done. The rpm -V or rpm verify should look normal. If it is relatively normal, then the reinstallation won’t need to be attempted.
With the modified dry run, we are now confident from the output that we have given the correct commands. It was a huge relief to hear that the erase commands solved the issue. The rpm database was fixed, and the update commands worked fine after that. The rpm -V verify check demonstrated that there was no need to reinstall the lvm package, further minimizing the risk.
Mind you that there was any large percentage chance of something going wrong if and only if all precautions were taken and if the correct procedure was applied. As systems engineers who focus on reliability – we understand that nothing should be left to chance or ambiguity when vital data is at stake. That even relatively safe operations with low risk odds, but potentially high impact should be tested and should be carefully considered. Much like a surgeon would follow certain protocols during an operation to minimize risks.
All in all, it took less than an hour to find the problem, but we spent days of work putting together a solution that was functional, complete, tested and as safe as possible.