An LVM volume group (VG) on our Proxmox VE cluster has failed to create new logical volumes, reporting that its metadata was full. At first this appears to be easy, “fine I’ll just add more space for metadata”, but it quickly revealed to be an versity to struggle through.

root@iBug-Server:~# lvcreate -L 4M -n test-1721 test
  VG test 1723 metadata on /dev/sdc1 (521759 bytes) exceeds maximum metadata size (521472 bytes)
  Failed to write VG test.
root@iBug-Server:~# # wut?

Problems

It isn’t hard to imagine that, just like regular disks need a partition table, LVM also needs its “partition table”, called LVM metadata, to store its information about PVs, VGs and LVs. It grows with the complexity of a VG, like number of PVs and configuration of LVs.

The metadata size and capacity of a PV and a VG can be inspected with pvdisplay and vgdisplay, respectively.

root@iBug-Server:~# pvdisplay -C -o name,mda_size,mda_free
  PV         PMdaSize  PMdaFree
  /dev/sdc1   1020.00k        0
root@iBug-Server:~# vgdisplay -C -o name,mda_size,mda_free
  VG   VMdaSize  VMdaFree
  test  1020.00k        0

The metadata area (whence mda) is where LVM stores volume information. The trouble comes from the fact that LVM MDA has multiple oddities going against intuition, which adds to the complexity of findin a solution.

1. “Metadata” is an ambiguous term

If you just go ahead and search for “LVM metadata size”, you’ll be surprised to see how irrelevant the search results are. In fact, they’re about “thin pool metadata”, which is a discrete LV usually named poolname_tmeta.

In fact, the correct answer is in the man page, which should show up as the first Google result, pvcreate(8). This is where I discovered the use of pvs and vgs to get the sizes.

2. The default MDA size is fixed

Contrary to common expectations, the default value for MDA size is fixed and does not scale with PV size or VG size. This is explained in the man page, right above pvs -o mda_size.

This is not the case, however, for LVM Thin Pools. It’s not known what the design considerations are behind this.

3. The size of the MDA cannot be changed after creation

As many would probably have, I also thought that “fine, I’ll just expand the size for the MDA”, and started digging through Google and relevant man pages. Another quarter-hour was spent trying to find how to do this, only to find that it can only be set at the creation of the PV. This was confirmed by this Proxmox forum post.

4. Reducing “metadata copies” does not free up space

There’s also a pvmetadatacopies option listed in both vgchange(8) and pvchange(8), which appears tempting to give a try. Unfortunately, opposite to intuition again, this does not free up half of the MDA space. Setting it to 1 down from the default 2 produces no visible changes.

Finding the solution

At this point I had figured out a silhouette for the problem I was facing: A VG on a single PV, fixed MDA size, no room to free up any metadata.

Fortunately, the shared SAN target supports “overcommitting”, meaning I can have an extra LUN with little effort. Given that the utilized storage is slightly over 50%, it’s not possible to move data onto the new LUN. Even if there were enough free space, moving data would take an infeasible amount of time. Ideally this new LUN shouldn’t be too large, to minimize possible aftermath should the underlying disk group goes full.

So, how can this trouble be overcome, with the help of a new LUN?


Digging into this level of details, Google is unable to help, so I had to resort to man pages, if I did not have to check the source codes.

Looking at pvchange(8), the only modifiable property of an existing PV is metadataignore. It instructs LVM to ignore the MDA for a PV.

A possible solution has arisen: Create a new PV with large enough MDA, merge it into the VG, and disable metadata storage on the old PV.

Solution

I created a new LUN in the storage server’s dashboard and loaded it onto all servers in the cluster using iscsiadm:

iscsiadm -m session --rescan

The rescan may have some delay so I continued monitoring it for a minute before /dev/sdd showed up on all hosts.

Now I turn the new block device into a PV and add it to the problematic VG:

pvcreate --metadatasize 64m /dev/sdd
vgextend test /dev/sdd

Partly to my surprise, a warning popped up:

VG test 1723 metadata on /dev/sdc1 (521615 bytes) exceeds maximum metadata size (521472 bytes)
WARNING: Failed to write an MDA of VG test.
Volume group "test" successfully extended

This one isn’t hard to understand: The VG metadata must record the identifiers of all participating PVs, so adding a PV means more metadata to be stored.

So before pulling this off, I had to remove a LV temporarily. I had a few laying around for testing purposes, so finding one to get rid of was not hard. After that I could repeat the vgextend command without a single warning.

Next I exclude the original PV from storing metadata:

pvchange --metadataignore y /dev/sdc1

Now I can add another LV inside this VG without error:

root@iBug-Server:~# lvcreate -L 1M -n test-1721 test
  Rounding up size to full physical extent 4.00 MiB
  Logical volume "test-1721" created.
root@iBug-Server:~# pvs -o name,mda_size,mda_free
  PV         PMdaSize  PMdaFree
  /dev/sdc1   1020.00k        0
  /dev/sdd     <65.00m   <32.00m

Caveats

LVM by default stores an identical copy of the metadata on every PV that belongs to the same VG. Using this “solution”, the complete metadata is only stored on the newly created PV. You certainly want to use reliable storage for this new PV as it’s now a SPOF of the whole VG.

If in any case you want a copy of the metadata for inspection or to recover a failed VG (hope you don’t need to do that), LVM maintains automatic backups under /etc/lvm/backup. They’re in their original form, are text-based (so easily readable), and are ready for use with vgcfgrestore.

Indeed, the recommended solution is to create a new, larger VG and migrate your data ASAP. After all, data security matters the most.

Leave a comment