iBug

Driving pppd with systemd

2024-07-07T00:00:00+00:00

I moved my soft router (Intel N5105, Debian) from school to home, and at home it’s behind an ONU on bridge mode, so it’ll have to do PPPoE itself.

Getting started with PPPoE on Debian is exactly the same as on Ubuntu: Install pppoeconf and run pppoeconf, then fill in the DSL username and password. Then I can see ppp0 interface up and working.

However, as I use systemd-networkd on my router while pppd appears to bundle ifupdown, I’ll have to fix everything needed for pppd to work with systemd-networkd.

Systemd service

The first thing is to get it to start at boot. Looking through Google, a Gist provides the exact systemd service file I need. After copying it to /etc/systemd/system/ppp@.service, I tried to start it with systemctl start pppd@dsl-provider. It seems like there’s a misconfiguration:

					/usr/sbin/pppd: Can't open options file /etc/ppp/peers/dsl/provider: No such file or directory

				

The instance name is surely dsl-provider and not dsl/provider, so I look more closely at the service file.

					[...]
Description=PPP connection for %I
[...]
ExecStart=/usr/sbin/pppd up_sdnotify nolog call %I

				

The systemd man page systemd.unit(5) says:

Specifier Meaning Details

“%i” Instance name For instantiated units this is the string between the first “@” character and the type suffix. Empty for non-instantiated units.

“%I” Unescaped instance name Same as “%i”, but with escaping undone.

Specifier	Meaning	Details
“%i”	Instance name	For instantiated units this is the string between the first “@” character and the type suffix. Empty for non-instantiated units.
“%I”	Unescaped instance name	Same as “%i”, but with escaping undone.

Fair enough, let’s change %I to %i and try starting pppd@dsl-provider again.

systemd-networkd

Now that ppp0 is up, time to configure routes and routing rules with systemd-networkd. I created a file /etc/systemd/network/10-ppp0.network.

					[Match]
Name=ppp0

[Network]
DHCP=yes
# ...

				

After restarting systemd-networkd, I was disappointed to see the PPP-negotiated IP address removed, only leaving an SLAAC IPv6 address behind. With some searching through systemd.network(5), I found KeepConfiguration=yes was what I was looking for.

Start order

One problem still remains: At the time systemd-networkd starts, ppp0 is not yet up, and systemd-networkd simply skips its configuration. A solution seems trivial:

					# systemctl edit pppd@dsl-provider
[Unit]
Before=systemd-networkd.service

				

… except it doesn’t seem to have any effect.

I wouldn’t bother digging into pppd, so I look around for something analogous to ifupdown’s up script, which is /etc/ppp/ip-up.d/. So I could just drop another script to notify systemd-networkd.

					# /etc/ppp/ip-up.d/1systemd-networkd
#!/bin/sh

networkctl reconfigure "$PPP_IFACE"

I also noticed that when bringing in ifupdown, the pppoeconf-created config looks like this:

					auto dsl-provider
iface dsl-provider inet ppp
    pre-up /bin/ip link set enp3s0 up # line maintained by pppoeconf
    provider dsl-provider

				

So to maintain behavioral compatibility, I configured the systemd service like this:

					# systemctl edit pppd@dsl-provider
[Unit]
BindsTo=sys-subsystem-net-devices-enp3s0.device
After=sys-subsystem-net-devices-enp3s0.device

				

After multiple reboots and manual restarts of pppd@dsl-provider.service, I’m convinced that this is a reliable solution.

Extra: IPv6 PD

As the home ISP provides IPv6 Prefix Delegation (but my school didn’t), it would be nice to take it and distribute it to the LAN. Online tutorials are abundant, e.g. this one. With everything set supposedly up, I was again disappointed to see only a single SLAAC IPv6 address on ppp0 itself, and journalctl -eu systemd-networkd shows no sign of receiving a PD allocation.

After poking around with IPv6AcceptRA= and [DHCPv6] PrefixDelegationHint= settings for a while, I decided to capture some packets for investigation. I started tcpdump -i ppp0 -w /tmp/ppp0.pcap icmp6 or udp port 546 and restarted systemd-networkd. After a few seconds, the pcap file contains exactly 4 packets that I need (some items omitted for brevity):

					- ICMPv6: Router Solicitation from 00:00:00:00:00:00
- ICMPv6: Router Advertisement from 00:00:5e:00:01:99
  - Flags: 0x40 (only O)
  - ICMPv6 Option: Prefix information (2001:db8::/64)
    - Flags: L + A
- DHCPv6: Information-request XID: 0x8bf4f0 CID: 00020000ab11503f79e54f10745d
  - Option Request
    - Option: Option Request (6)
    - Length: 10
    - Requested Option code: DNS recursive name server (23)
    - Requested Option code: Simple Network Time Protocol Server (31)
    - Requested Option code: Lifetime (32)
    - Requested Option code: NTP Server (56)
    - Requested Option code: INF_MAX_RT (83)
- DHCPv6: Reply XID: 0x8bf4f0 CID: 00020000ab11503f79e54f10745d

				

Clearly the client isn’t even requesting a PD allocation with PrefixDelegationHint= set. With some more Google-ing, I added [DHCPv6] WithoutRA=solicit to 10-ppp0.network and restarted systemd-networkd. There are 6 packets, but the order appears a little bit off:

					- Solicit XID: 0x2bc2aa CID: 00020000ab11503f79e54f10745d
- Advertise XID: 0x2bc2aa CID: 00020000ab11503f79e54f10745d
- Request XID: 0xf8c1dd CID: 00020000ab11503f79e54f10745d
  - Identity Association for Prefix Delegation
- Reply XID: 0xf8c1dd CID: 00020000ab11503f79e54f10745d
- Router Solicitation from 00:00:00:00:00:00
- Router Advertisement from 00:00:5e:00:01:99

				

This time DHCP request comes before the RS/RA pair, which is not what I expected. But at least it’s now requesting a PD prefix.

Then I found this answer straight to the point, summarized as:

The “managed” (M) flag indicates the client should acquire an address via DHCPv6, and triggers DHCPv6 Solicit and Request messages.
The “other” (O) flag indicates the client should do SLAAC while acquiring other configuration information via DHCPv6, and triggers DHCPv6 Information-request messages.
When both flags are present, the O flag is superseded by the M flag and has no effect.

So systemd-networkd is implementing everything correctly, and I should configure systemd-networkd to always send Solicit messages regardless of the RA flags received. This is done by setting [IPv6AcceptRA] DHCPv6Client=always

Now with every detail understood, after a restart of systemd-networkd, I finally see the PD prefix allocated:

					systemd-networkd[528]: ppp0: DHCP: received delegated prefix 2001:db8:0:a00::/60
systemd-networkd[528]: enp1s0: DHCP-PD address 2001:db8:0:a00:2a0:c9ff:feee:c4b/64 (valid for 2d 23h 59min 59s, preferred for 1d 23h 59min 59s)
systemd-networkd[528]: enp2s0: DHCP-PD address 2001:db8:0:a01:2a0:c9ff:feee:c4c/64 (valid for 2d 23h 59min 59s, preferred for 1d 23h 59min 59s)

				

Update: Stuck booting

A few days after this blog post, my local ISP ran into an outage that rendered the PPPoE connection unoperational. When I couldn’t identify the issue initially, I tried rebooting the router and it never came back up again. I plugged in a monitor and a keyboard, only to find systemd repeatedly trying to bring up pppd@dsl-provider.service when it would not succeed. The failure to start pppd resulted in complete unavailability of the network stack.

I recalled that with OpenWRT this wasn’t the case, as the PPPoE interface being down would not impact any other interfaces. So I ended up removing all dependencies on pppd@.service, making it an ordinary system service that’s only WantedBy=multi-user.target. Considering that pppd will call networkctl reconfigure when it establishes the ppp0 interface, the removal of systemd dependences shouldn’t have any consequences.

Sum up

Use systemd to start pppd as a system service.
- No need to bother with ordering.
Add KeepConfiguration=yes to ppp0.network.
Use a custom script in ip-up.d to invoake systemd-networkd to reconfigure after it’s up.

For IPv6 PD, use both:

							[DHCPv6]
PrefixDelegationHint=::/60

[IPv6AcceptRA]
DHCPv6Client=always

Migrating Ubuntu onto ZFS

2024-05-14T00:00:00+00:00

As part of a planned disk migration, I decided to move my Ubuntu installation from a traditional ext4 setup to ZFS. I did a lot of preparation and research, but things went much smoother than I had previously anticipated. I did not even have to consult IPMI for any recovery.

Existing partition layout:

				# fdisk -l /dev/nvme1n1
[...]
Device             Start        End    Sectors  Size Type
/dev/nvme1n1p1      2048    1050623    1048576  512M EFI System
/dev/nvme1n1p2   1050624  269486079  268435456  128G Linux filesystem
/dev/nvme1n1p3 269486080 3907029134 3637543055  1.7T Solaris /usr & Apple ZFS

			

Since I already have /home running on ZFS pool0, there’s not much to prepare. All I need to move is the rootfs itself, which has around 20 GB of data.

Start by installing anything necessary:

apt install zfs-initramfs arch-install-scripts

Then create the dataset layout:

				# pool0 already has xattr=sa
zfs create \
  -o canmount=off \
  -o mountpoint=none \
  -o acltype=posix \
  pool0/ROOT
zfs create -o mountpoint=/mnt/new pool0/ROOT/ubuntu

rsync -avSHAXx --delete / /mnt/new/

			

Now there’s a little deviation from common setup. I don’t trust GRUB’s ZFS support, so I’m going to merge /boot into the EFI partition (which has a decent 512 MB of capacity). This is a decision made after surveying my friends’ setup.

				# Merge data
rsync -ax /boot/ /boot/efi/ # Ignore any errors
umount /boot/efi
vim /etc/fstab
# Change /boot/efi to /boot
# Also remove the current rootfs entry
systemctl daemon-reload
mount /boot

			

Now prepare GRUB:

				zpool set bootfs=pool0/ROOT/ubuntu pool0
mount -o bind /boot /mnt/new/boot
arch-chroot /mnt/new

			

				# grub-install
Installing for x86_64-efi platform.
grub-install: error: cannot find EFI directory.

			

Well, if only grub-install didn’t hard-code /boot/efi (which is against the FHS standard anyways). Fortunately, I recall a small detail that could make this work in another convenient way:

dpkg-reconfigure grub-efi-amd64

Also regenerate GRUB configuration:

				zfs set mountpoint=/ pool0/ROOT/ubuntu
update-grub

Now double-check the GRUB configuration at /boot/grub/grub.cfg and make sure there are lines like this:

linux /vmlinuz [...] root=ZFS=pool0/ROOT/ubuntu [...]

After verifying paths to the kernel and the initrd image are correct, reboot:

reboot

In just a minute, I noticed my server came back up. Time to confirm everything is working as expected:

				# mount
pool0/ROOT/ubuntu on / type zfs (rw,relatime,xattr,posixacl,casesensitive)

# df -h /
Filesystem         Size  Used Avail Use% Mounted on
pool0/ROOT/ubuntu  1.2T   11G  1.1T   1% /

# zfs get compressratio pool0/ROOT
NAME        PROPERTY       VALUE  SOURCE
pool0/ROOT  compressratio  2.02x  -

			

The last thing is to rewrite my rootfs backup script to take snapshots directly, instead of rsync-ing to another ZFS pool before taking a snapshot there. After taking a snapshot, I can also send it away as a “backup against disk failure”.

A slightly revised version of my snapshotting script, sans the sending part:

				#!/bin/sh

set -e

DATASET=pool0/ROOT/ubuntu
DATE=$(date +%Y%m%d)
SNAPSHOT="$DATASET@$DATE"
RETENTION_DAYS="${1:-7}"
RETENTION="$((RETENTION_DAYS * 86400))"

NOW="$(($(date +%s) - 3600))"
if [ "$(zfs list -Hpo name "$SNAPSHOT")" = "$SNAPSHOT" ]; then
  echo "Snapshot exists: $SNAPSHOT"
else
  zfs snapshot -ro ibug:retention="$RETENTION" "$SNAPSHOT"
fi

zfs list -Hpt snapshot -o name,creation,ibug:retention "$DATASET" |
  while read -r zNAME zCREATION zRETENTION; do
  if [ "$zRETENTION" = "-" ]; then
    # assume default value
    zRETENTION="$((7 * 86400))"
  fi
  UNTIL="$((zCREATION + zRETENTION))"
  UNTIL_DATE="$(date -d "@$UNTIL" "+%Y-%m-%d %H:%M:%S")"
  echo "$zNAME: $UNTIL_DATE"
  if [ "$NOW" -ge "$UNTIL" ]; then
    zfs destroy -rv "$zNAME"
  fi
done

			

				# crontab
15 4 * * 1,5     /root/backup.sh 30
15 4 * * 0,2-4,6 /root/backup.sh  7

			

Reload SSL certificates with systemd

2024-03-31T00:00:00+00:00

Recently I relinquished an old domain on my server and had to re-issue a certificate to drop that domain off. Previously it ran Let’s Encrypt’s official client Certbot, set up back in 2019. All my recent setups have been using acme.sh, so I figured that this was a perfect chance to switch this one over as well.

Getting acme.sh to issue a new certificate for my updated domain list is easy enough and out of scope for this article. But when it comes to reloading the certificate for services using it, I have to think twice. Back in the days when Nginx was the sole consumer of the certificate, I directly referenced the certificate files in /etc/letsencrypt/live/ from Nginx config, and somehow slappped a systemctl reload nginx into crontab to handle the reload. Now that there are multiple services using the certificate, I no longer consider it a good idea to reload all the services in a crontab. There has to be a better way.

Since all my services are managed by systemd, using an extra “service” or whatever unit to group them together seems like a better idea. Systemd’s ReloadPropagatedFrom= option and its inverse PropagatesReloadTo= immediately come to mind. With the right direction, it’s easy to Google out this answer: How do I reload a group of systemd services?

Realizing that “target” is the simplest unit type in systemd’s abstraction, this is the minimum that suits my needs.

			# /etc/systemd/system/ssl-certificate.target
[Unit]
Description=SSL certificates reload helper
PropagatesReloadTo=nginx.service
PropagatesReloadTo=postfix.service

		

Then, following the above Unix & Linux answer, here’s a “path” unit that lets systemd monitor the certificate files for changes.

			# /etc/systemd/system/ssl-certificate.path
[Unit]
Description=SSL certificate reload helper
Wants=%N.target

[Path]
PathChanged=/etc/ssl/private/%H/cert.pem

[Install]
WantedBy=multi-user.target

		

The Wants= setting here ensure that the corresponding target unit is activated, otherwise it cannot be reloaded.

There’s one deficiency in the answer above: A “path” unit can only activate another unit, not reload it. So I still have to create a oneshot service that calls systemctl reload on the target, which itself can then be activated by the “path” unit.

			# /etc/systemd/system/ssl-certificate.service
[Unit]
Description=SSL certificate reload helper
StartLimitIntervalSec=5s
StartLimitBurst=2

[Service]
Type=oneshot
ExecStart=/bin/systemctl reload %N.target

		

It’s important that this service comes with Type=oneshot and without RemainAfterExit=yes, so that it can be repeatedly activated by the “path” unit.

Now I can test if things work:

			systemctl daemon-reload
systemctl enable --now ssl-certificate.path
acme.sh --install-cert -d "$HOSTNAME" \
  --cert-file "/etc/ssl/private/$HOSTNAME/cert.pem" \
  --key-file "/etc/ssl/private/$HOSTNAME/privkey.pem" \
  --fullchain-file "/etc/ssl/private/$HOSTNAME/fullchain.pem"

		

And then inspect the services:

			$ systemctl status nginx.service
[...]
Mar 31 19:20:11 hostname systemd[1]: Reloading A high performance web server and a reverse proxy server...
Mar 31 19:20:12 hostname systemd[1]: Reloaded A high performance web server and a reverse proxy server.

$ systemctl status postfix.service
[...]
Mar 31 19:20:11 hostname systemd[1]: Reloading Postfix Mail Transport Agent...
Mar 31 19:20:12 hostname systemd[1]: Reloaded Postfix Mail Transport Agent.

		

So now, job done. As acme.sh stores install information, the next time these certificates are renewed, acme.sh will automatically copy them over to /etc/ssl/private/$HOSTNAME/, and systemd will pick up the changes and reload the services.

I almost broke our lab’s storage server…

2024-03-13T00:00:00+00:00

Recently we discovered that both SSDs on our storage server were giving worrisome SMART values, so we started replacing them. One of them was used only for ZFS L2ARC, so pulling it out was easy. The other runs the rootfs and we couldn’t touch it for the time being, so we inserted a backup drive thinking we can migrate the OS later on.

After returning from the datacenter, I start working on the migration. The initial steps are nothing but ordinary:

Examine the spare drive to ensure there’s no important data on it, then wipe it (blkdiscard -f /dev/sdb).
Create the partition table that closely resembles the current system drive’s layout: 100 MB for the EFI partition (down from 512 MB), 32 GB for rootfs, and the rest for an LVM PV.
Format the partitions: mkfs.vfat /dev/sdb1, mkfs.ext4 /dev/sdb2, pvcreate /dev/sdb3.
Copy the rootfs over: mount /dev/sdb2 /t, rsync -aHAXx / /t.
Reinstall the bootloader: mount /dev/sdb1 /t/boot/efi, arch-chroot /t, grub-install (target is x86_64-efi), update-grub.
Start migrating LVs: vgextend pve /dev/sdb3, pvmove /dev/sda3 /dev/sdb3.

At this point, a quick thought emerges: This is not the final drive to run the system on and is only here for the transitional period. A second migration is planned when the new SSD arrives. So why not take this chance and move the rootfs onto LVM as well?

With that in mind, I hit Ctrl-C to pvmove, unbeknownst that it’s interruptible and terminating the pvmove process only pauses the operation. For a moment, I thought I successfully canceled it and tried to re-partition the new drive. Since the new PV is still in use by the suspended pvmove operation, the kernel would not accept any changes to /dev/sdb3. During this process, I deleted and recreated the new rootfs (/dev/sdb2) and the new PV (/dev/sdb3) many times, and even tried manually editing LVM metadata (via vgcfgbackup pve, edit /etc/lvm/backup/pve and vgcfgrestore pve), before finally giving up and rebooting the system.

As a daily dose for a SysAdmin, the server didn’t boot up as expected. I fired up a browser to connect to the machine’s IPMI, only to find that the remote console feature for iDRAC 9 was locked behind a paywall for goodness’ sake. Thanks to God almighty Dell, things have been unnecessarily more complicated than ever before. I carefully recalled every step taken and quickly identified the problem - one important thing forgotten - GRUB was successfully reinstalled on the new EFI partition (which was somehow left intact during the whole fiddling process), pointing to the now-deleted new root partition, and so it’s now stuck with GRUB.

Fortunately, out of precaution, I had previously configured the IPMI with serial-over-LAN, so I at least still have serial access to the server with ipmitool. This saved me from a trip back to the datacenter.

			ipmitool -I lanplus -H  -U  -P  sol activate

		

And better yet, this iDRAC 9 can change BIOS settings, most notably the boot order and one-time boot override. This definitely helped the most in the absence of that goddamn remote console.

After some trial and error, I got myself into the GRUB command line, and it didn’t look quite well:

grub rescue>

There’s pretty much just the ls command, and it doesn’t even recognize the EFI partition (FAT32 filesystem). With some more twiddling, I found this “rescue mode” capable of reading ext4, which shed some light to the situation.

			grub rescue> set root=(hd0,gpt2)
grub rescue> ls /boot/grub
fonts  grub.cfg  grubenv  locale  unicode.pf2  x86_64-efi

		

Now things began to turn to the upswing.

			grub rescue> set prefix=/boot/grub
grub rescue> insmod normal
grub rescue> normal

		

In a few seconds, I was delighted to discover that the system was up and running, and continued migrating the rootfs.

After everything’s done, out of every precaution, I installed grub-efi-amd64-signed, which provides a large, monolithic grubx64.efi that has all the “optional” modules built-in, so it no longer relies on the filesystem for, e.g., LVM support, in case a similar disaster happens again.

Anecdote

When trying to remove the faulty drive from the server, I at first made a wrong recall for its position, and we instead pulled out a running large-capacity HDD. Luckily it was not damaged, so we quickly inserted it back. Thanks to ZFS’s design, it automatically triggered a resilver, which completed in just a blink.

			# zpool status
  pool: rpool
 state: ONLINE
  scan: resilvered 63.4M in 00:00:03 with 0 errors on Tue Mar 12 17:03:23 2024

		

If this were a hardware RAID, a tedious and time-consuming rebuild would have been inevitable. It’s only with ZFS that this rapid recovery is possible.

Conclusion

This incident was a good lesson for me, and some big takeaways I’d draw:

Don’t panic under pressure.
Use ZFS so you can sleep well at night.
Fuck Dell, next time buy from another vendor that doesn’t lock basic features behind a paywall.

Plus, the correct way to cancel a pvmove operation is in man 8 pvmove, and it’s right at the 2nd paragraph of the Description section.

My firewall solution for RDP

2024-02-28T00:00:00+00:00

Today I stumbled upon this V2EX post (Simplified Chinese) where the OP shared their PowerShell implementation of a “makeshift fail2ban” for RDP (their GitHub repository). Their script looked very clean and robust, but needless to say, it is unnecessarily difficult on Windows. So on this rare (maybe?) occasion I decide to share my firewall for securing RDP access to my Windows hosts.

None of my Windows hosts (PCs and VMs) has their RDP port exposed to the public internet directly, and they’re all connected to my mesh VPN (which is out of scope for this blog article). My primary public internet entry gateway for the intranet runs Debian with fully manually configured iptables-based firewall, and I frequently work on it through SSH.

My goal is to expose the RDP port only to myself. There are a few obvious solutions eliminated for different reasons:

VPN is inconvenient as I don’t want to connect to VPN just for RDP when I don’t need it otherwise.
SSH port forwarding is not performant for two things: Double-encryption and lack of UDP support.

The question arises that if SSH access is sufficiently convenient, why not use it as an authentication and authorization mechanism? So I came up with this:

A pre-configured iptables rule set to allow RDP access from a specific IP set. For example:

					*filter
:FORWARD DROP
-A FORWARD -d 192.0.2.1 -p tcp --dport 3389 -m set --set ibug -j ACCEPT

*nat
-A RDPForward -p tcp --dport 3389 -j DNAT --to-destination 192.0.2.1:3389
-A RDPForward -p udp --dport 3389 -j DNAT --to-destination 192.0.2.1:3389

				

A way to keep the client address in the set for the duration of the SSH session. I use SSH user rc file to proactively refresh it:

					#!/bin/bash
# rwxr-xr-x ~/.ssh/rc

if [ -z "$BASH" ]; then
  exec /bin/bash -- "$0" "$@"
  exit 1
fi

_ssh_client="${SSH_CONNECTION%% *}"
_ppid="$(ps -o ppid= $(ps -o ppid= $PPID))"

nohup ~/.local/bin/_ssh_refresh_client "$_ssh_client" "$_ppid" &>/dev/null & exit 0

				

					#!/bin/sh
# rwxr-xr-x ~/.local/bin/_ssh_refresh_client
_ssh_client="$1"
_ppid="$2"
while kill -0 "$_ppid" 2>/dev/null; do
  sudo ipset -exist add ibug "$_ssh_client" timeout 300
  sleep 60
done
exit 0

				

The idea is to refresh (ipset add with timeout) the IPset entry as long as the SSH session remains. When SSH disconnects, the script stops refreshing and IPset will clean it up after the specified time.

To determine the presence of the associated SSH session, the scripts finds the PID of the “session manager process”. The “parent PID” is read twice because sshd double-forks. The client address is conveniently provided in the environment variable, so putting all these together yields precisely what I need.

The only caveat is the use of sudo, as ipset requires CAP_NET_ADMIN for interacting with the kernel network stack. It’s certainly possible to write an SUID binary as a wrapper, but for me configuring passwordless sudo for the ipset command satisfies my demands.

So now whenever I need to RDP to my computer through this forwarded port on the public internet, I can just SSH into the gateway and it’ll automatically grant me 5 minutes of RDP access from this specific network. All traffic forwarding is done in the kernel with no extra encapsulation or encryption, ensuring the best possible performance for both the endpoints and the gateway router itself.

Request limiting in Nginx

2024-01-23T00:00:00+00:00

Nginx has a built-in module limit_req for rate-limiting requests, which does a decent job, except its documentation is not known for its conciseness, plus a few questionable design choices. I happen to have a specific need for this feature so I examined it a bit.

As always, everything begins with the documentation. A quick-start example is given:

			http {
    limit_req_zone $binary_remote_addr zone=one:10m rate=1r/s;
    ...
    server {
        ...
        location /search/ {
            limit_req zone=one burst=5;
        }

		

The basis is the limit_req_zone directive, which defines a shared memory zone for storing the states of the rate-limiting. Its arguments include the key, the size and the name of the zone, followed by the average or sustained rate limit. The rate limit has two possible units: r/s or r/m. It also says

The limitation is done using the “leaky bucket” method.

So far so good, except the burst limit is … specified on where it’s used? Moving on for now.

The limit_req directive specifies when the requests should be limited.

If the requests rate exceeds the rate configured for a zone, their processing is delayed such that requests are processed at a defined rate.

Seems pretty clear but slightly counter-intuitive. By default, burst requests are queued up and delayed until the rate is below the limit, whereas most common rate-limiting implementations would simply serve them.

I find it easier to understand this model with a queue. Each key defines a queue where items are popped at the specified rate (e.g. 1r/s). Incoming requests are added to the queue, and are only served upon exiting the queue. The queue size is defined by the burst limit, and excess requests are dropped when the queue is full.

The more common behavior, however, requires an extra option:

If delaying of excessive requests while requests are being limited is not desired, the parameter nodelay should be used:
limit_req zone=one burst=5 nodelay;

With nodelay, requests are served as soon as they enter the queue:

The next confusing option, conflicting with nodelay, is delay:

The delay parameter specifies a limit at which excessive requests become delayed. Default value is zero, i.e. all excessive requests are delayed.

After a bit of fiddling, I realized the model is now like this:

So what delay actually means is to delay requests after this “delay limit” is reached. In other words, requests are served as soon as they arrive at the n-th position in the front of the queue.

During all these testing, I wasn’t happy with existing tools for testing, so I wrote my own one, despite its simplicity: GitHub Gist.

With this new tool, I can now (textually) visualize the behavior of different options. Under the burst=5 and delay=1 setup, the output is like this:

			$ go run main.go -i 10ms -c 10 http://localhost/test
[1] Done [0s] [200 in 2ms]
[2] Done [10ms] [200 in 1ms]
[3] Done [21ms] [200 in 981ms]
[4] Done [31ms] [200 in 1.972s]
[5] Done [42ms] [200 in 2.962s]
[6] Done [53ms] [200 in 3.948s]
[7] Done [64ms] [503 in 0s]
[8] Done [75ms] [503 in 1ms]
[9] Done [85ms] [503 in 0s]
[10] Done [95ms] [503 in 0s]

		

If you try the tool yourself, the HTTP status codes are colored for even better prominence.

In the above example, the first request is served immediately as it also exits the queue immediately. The second request is queued at the front, and because delay=1, it’s also served immediately. Subsequent requests are queued up until the sixth when the queue becomes full. The seventh and thereafter are dropped.

If we change delay=0, the output becomes:

			$ go run main.go -i 10ms -c 10 http://localhost/test
[1] Done [0s] [200 in 2ms]
[2] Done [10ms] [200 in 993ms]
[3] Done [21ms] [200 in 1.982s]
[4] Done [32ms] [200 in 2.973s]
[5] Done [43ms] [200 in 3.959s]
[6] Done [54ms] [200 in 4.949s]
[7] Done [65ms] [503 in 1ms]
[8] Done [75ms] [503 in 1ms]
[9] Done [85ms] [503 in 2ms]
[10] Done [96ms] [503 in 1ms]

		

Still only the first 6 requests are served, but the 2nd to the 6th are delayed by an additional second due to the removal of delay=1.

Under this model, the nodelay option can be understood as delay=infinity, while still respecting the burst limit.

One more question

Why is the burst limit specified at use time, instead of at zone definition? Only experiments can find out:

			location /a {
    limit_req zone=test burst=1;
}
location /b {
    limit_req zone=test burst=5;
}

		

Then I fire up two simultaneous batches of 10 requests each to /a and /b respectively:

			$ go run main.go -i 10ms -c 10 http://localhost/a
[1] Done [0s] [200 in 2ms]
[2] Done [10ms] [200 in 992ms]
[3] Done [21ms] [503 in 0s]
[4] Done [32ms] [503 in 0s]
[5] Done [42ms] [503 in 0s]
[6] Done [53ms] [503 in 0s]
[7] Done [63ms] [503 in 0s]
[8] Done [73ms] [503 in 0s]
[9] Done [83ms] [503 in 0s]
[10] Done [94ms] [503 in 0s]

		

			$ go run main.go -i 10ms -c 10 http://localhost/b
[1] Done [0s] [200 in 1.862s]
[2] Done [11ms] [200 in 2.852s]
[3] Done [21ms] [200 in 3.842s]
[4] Done [32ms] [200 in 4.832s]
[5] Done [43ms] [503 in 1ms]
[6] Done [54ms] [503 in 0s]
[7] Done [64ms] [503 in 0s]
[8] Done [75ms] [503 in 1ms]
[9] Done [85ms] [503 in 0s]
[10] Done [95ms] [503 in 1ms]

		

As can be seen from the output, the batch to /a is served as usual, but the batch to /b is significantly delayed, and two fewer requests are served.

If I reverse the order of sending the batches, the result is different again:

			$ go run main.go -i 10ms -c 10 http://localhost/b
[1] Done [0s] [200 in 2ms]
[2] Done [10ms] [200 in 993ms]
[3] Done [20ms] [200 in 1.982s]
[4] Done [31ms] [200 in 2.974s]
[5] Done [42ms] [200 in 3.963s]
[6] Done [52ms] [200 in 4.955s]
[7] Done [63ms] [503 in 0s]
[8] Done [74ms] [503 in 0s]
[9] Done [84ms] [503 in 0s]
[10] Done [95ms] [503 in 0s]

		

			$ go run main.go -i 10ms -c 10 http://localhost/a
[1] Done [0s] [503 in 1ms]
[2] Done [10ms] [503 in 1ms]
[3] Done [20ms] [503 in 0s]
[4] Done [31ms] [503 in 0s]
[5] Done [42ms] [503 in 0s]
[6] Done [52ms] [503 in 0s]
[7] Done [63ms] [503 in 1ms]
[8] Done [73ms] [503 in 0s]
[9] Done [83ms] [503 in 0s]
[10] Done [93ms] [503 in 0s]

		

This time the batch to /b is served as usual, but the entire batch to /a is rejected.

I am now convinced that the queue itself is shared between /a and /b, and each limit_req directive decides for itself whether and when to serve the requests. So when /a is served first, the queue holds one burst request, and /b fills the queue up to 5 requests. When /b is served first, the queue is already holding 5 requests and leaves no room for /a. Similarly, with the delay option, each limit_req directive can still decide when the request is ready to serve.

This is probably not the most straightforward design, and I can’t come up with a use case for this behavior. But at least now I understand how it works.

One last thing

I originally wanted to set up a 403 page for banned clients, and wanted to limit the rate of log writing in case of an influx of requests. The limit_req module does provide a $limit_req_status variable which appears to be useful. This is what I ended up with:

			limit_req_zone $binary_remote_addr zone=403:64k rate=1r/s;

map $limit_req_status $loggable_403 {
    default 0;
    PASSED 1;
    DELAYED 1;
    DELAYED_DRY_RUN 1;
}

server {
    access_log /var/log/nginx/403/access.log main if=$loggable_403;
    error_log /var/log/nginx/403/error.log warn;
    error_page 403 /403.html;
    error_page 404 =403 /403.html;
    limit_req zone=403;
    limit_req_status 403;
    limit_req_log_level info;

    location / {
        return 403;
    }
    location = /403.html {
        internal;
        root /srv/nginx;
        sub_filter "%remote_addr%" "$remote_addr";
        sub_filter_once off;
    }
}

		

With this setup, excessive requests are rejected by limit_req with a 403 status. Only 1r/s passes through the rate limiting, which will carry the PASSED status and be logged, albeit still seeing the 403 page from the return 403 rule. This does exactly what I want, so time to call it a day.

Visualizing Weather Forecast with Grafana

2024-01-08T00:00:00+00:00

Grafana is a great piece of software for visualizing data and monitoring. It’s outstanding at what it does when paired with a time-series database like InfluxDB, except this time I’m trying to get it to work as a weather forecast dashboard, instead of any historical time-series data.

I choose CaiYun Weather (彩云天气) API for having previous experience with it, as well as its unlimited free tier. I must admit that I initially came up with this idea for having seen the presence of JSON API datasource plugin for Grafana, which reminds me of CaiYun’s JSON API being a perfect fit.

JSON API Datasource

Configuring the datasource seems easy at first, like just inserting the URL and configure HTTP headers as needed. Since CY’s API puts the API key in the URL path, there’s no headers to configure. So I can just put a single URL and save it.

https://api.caiyunapp.com/v2.5/TAkhjf8d1nlSlspN/121.6544,25.1552/hourly.json

I choose the hourly API so I can have forecast for the upcoming 48 hours.

So far this is a readily available datasource that I can query. But after reviewing the JSON query editor, I decided to chop off the last segments of the URL and leave just the part up to the API key:

https://api.caiyunapp.com/v2.5/TAkhjf8d1nlSlspN/

The point here is, the query editor allows specifying an extra Path, which appears to be concatenated with this URL in the datasource configuration. Notably, I can then put the coordinates in a variable, use it in the query, and build a single dashboard for many cities.

Dashboard variables

Now that I have the query format planned, I can add a dashboard variable for selecting cities.

First things first, since I’m going to use the same datasource for all panels, I first add a variable for the datasource and restrict it to “CaiYun Weather”:

Then I add a variable $location for the city name, and provide it with a list of cities I want to show. The variable type would be “Custom” since this is just a human-maintained list. There certainly are better ways like using a relational database or an external API, making it easier to update, but for now I’d like to keep it simple.

			Beijing : 116.4074\,39.9042,Shanghai : 121.4691\,31.2243,Guangzhou : 113.2644\,23.1291,Shenzhen : 114.0596\,22.5429

		

Panels

First and foremost, the most intuitive metric to show is temperature. I add a time series panel and configure it to graph the temperature. Start by building the query:

Datasource: Select ${datasource}
Query A:
- Path: /${location}/hourly.json
- Fields:
  - JSONPath: $.result.hourly.temperature[*].value, Type: Number, Alias: ${location:text}
  - JSONPath: $.result.hourly.temperature[*].datetime, Type: Time

I stumbled on getting the time series to display correctly. It wasn’t anywhere obvious in the documentation for the JSON API plugin, but a series with Type = Time is required. Fortunately, CY’s API returns the time in ISO 8601 format in the datetime field, so I can feed it directly to Grafana.

So far so good, except Grafana shows “No data”. I realized Grafana is trying to show past data, but apparently a weather forecast provides future data. I need to change the time range to “now” and “now + 48h”. Ideally, this time range is fixed and not affected by the time range selector, since after all it’s limited by the API.

This is another place where I spent half an hour on Google. The answer is “Relative time” in “Query options”. Its format, however, is again unintuitive. While community posts shows 1d for “last 1 day” and the official docs gives several examples on using now, none of them told me how to indicate “next 48 hours”. The answer is just +48h or +2d. Notably, entering now+48h would result in an error.

To make the graph look nicer, I set the unit to “°C”, limit decimals to 1, and set the Y-axis range to 0-40, and add a series of thresholds with colors to indicate the temperature range. Also worth mentioning is to make the graph change its color according to the temperature, so I set “Graph style → Gradient mode” to “Scheme” and “Standard options → Color scheme” to “From thresholds (by value)”.

Now this panel looks stunning.

More panels

CY’s API offers a variety of weather data, so with little effort I can add more panels for humidity, precipitation and more, by duplicating the temperature panel and changing the query. I also need to change the unit and thresholds accordingly but that goes without saying.

There’s also a small piece worth displaying: A description text. It’s easy to put it in a “Stat” panel and display as “String” (instead of “Number”). And better yet, CY provides two descriptions: One for the next two hours, and one for the next two days. Two panels for two pieces of text, yeah.

One last thing I decided to leave out for now: The skycon field that describes the weather condition, like “CLEAR_DAY” or “RAIN”. It’d be comparably easy to add a panel for it, using “Value mapping” to change the text to something more human-readable, but I’m not at the high mood for it right now, so maybe I’ll pick it up later.

Results

Now I have a nice dashboard for viewing weather forecast for multiple cities:

If you’d like to try it yourself, I’ve published the dashboard on Grafana.com: Weather Forecast. Just add the same datasource with your API key, and you can import my dashboard and start getting weather forecast for yourself.

Understanding ZFS block sizes

2023-10-30T00:00:00+00:00

ZFS is about the most complex filesystem for single-node storage servers. Coming with its sophistication is its equally confusing “block size”, which is normally self-evident on common filesystems like ext4 (or more primitively, FAT). The enigma continues as ZFS bundles more optimizations, either for performance or in the name of “intuition” (which I would hardly agree). So recently I read a lot of materials on this and try to make sense of it.

We’ll begin with a slide from a ZFS talk from Lustre¹ (page 5):

Figure 1. ZFS I/O Stack

Logical blocks

The first thing to understand is that there are at least two levels of “block” concepts in ZFS. There’s “logical blocks” on an upper layer (DMU), and “physical blocks” on a lower layer (vdev). The latter is easier to understand and it’s almost synonymous to “disk sectors”. It’s precisely the ashift parameter in zpool create command and usually matches the physical sector size of your disks (4 KiB for modern disks). Once set, ashift is immutable and can only be changed when recreating the entire vdev array (fortunately not the entire pool²). The “logical block”, however, is slightly more complicated, and beyond the expressibility of a few words. In short, it’s the smallest meaningful unit of data that ZFS can operate on, including reading, writing, checksumming, compression and deduplication.

“recordsize” and “volblocksize”

You’ve probably seen recordsize being talked about extensively in ZFS tuning guides³, which is already a great source of confusion. The default recordsize is 128 KiB, which controls the maximum size of a logical block. The actual block size depends on the file you’re writing:

If the file is smaller than or equal to recordsize, it’s stored as a single logical block of its size, rounded up to the nearest multiple of 512 bytes.
If the file is larger than recordsize, it’s split into multiple logical blocks of recordsize each, with the last block being zero-padded to recordsize.

As with other filesystems, if you change a small portion of a large file, only 128 KiB (or whatever your recordsize is) is rewritten, along with new metadata and checksums. Large recordsize bloats the read/write amplification for random I/O workloads, while small recordsize increases the fragmentation and metadata overhead for large files. Note that ZFS always validates checksums, so every read operation is done on an entire block, even if only a few bytes are requested. So it is important to align your recordsize with your workload, like using 16 KiB for (most) databases and 1 MiB for media files. The default 128 KiB is a good compromise for general-purpose workloads, and there certainly isn’t a one-size-fits-all solution. Also note that while recordsize can be changed on the fly, it only affects newly written data, and existing ones stay intact.

For ZVOLs, as you’d imagine, the rule is much simpler: Every block of volblocksize is a logical block, and it’s aligned to its own size. Since ZFS 2.2, the default volblocksize is 16 KiB, providing a good balance between performance and compatibility.

Compression

Compression is applied on a per-block basis, and compressed data is not shared between blocks. This is best shown with an example:

			$ zfs get compression tank/test
NAME       PROPERTY     VALUE  SOURCE
tank/test  compression  zstd   inherited from tank
$ head -c 131072 /dev/urandom > 128k
$ cat 128k 128k 128k 128k 128k 128k 128k 128k > 1m
$ du -sh 128k 1m
129K    128k
1.1M    1m

		

			$ head -c 16384 /dev/urandom > 16k
$ cat 16k 16k 16k 16k 16k 16k 16k 16k > 128k1
$ cat 128k1 128k1 128k1 128k1 128k1 128k1 128k1 128k1 > 1m1
$ du -sh 16k 128k1 1m1
17K     16k
21K     128k1
169K    1m1

		

As you can see from du’s output above, despite containing 8 identical copies of the same 128 KiB random data, the 1 MiB file gains precisely nothing from compression, as each 128 KiB block is compressed individually. The other test of combining 8 copies of 16 KiB random data into one 128 KiB file shows positive results, as the 128 KiB file is only 21 KiB in size. Similarly, the 1 MiB file that contains 64 exact copies of the same 16 KiB chunk is exactly 8 times the size of that 128 KiB file, because the chunk data is not shared across 128 KiB boundaries.

This brings up an interesting point: It’s beneficial to turn on compression even for filesystems with uncompressible data⁴. One direct impact is on the last block of a large file, where its zero-filled bytes up to recordsize compress very well. Using LZ4 or ZSTD, compression should have negligible impact on any reasonably modern CPU and reasonably sized disks.

There are two more noteworthy points about compression, both from man zfsprops.7:

When any setting except off is selected, compression will explicitly check for blocks consisting of only zeroes (the NUL byte). When a zero-filled block is detected, it is stored as a hole and not compressed using the indicated compression algorithm.

Instead of compressing entire blocks of zeroes like the last block of a large file, ZFS will not store anything about these zero blocks. Technically, this is done by omitting the corresponding ranges from the file’s indirect blocks⁴.

Take this test for example: I created a file with 64 KiB of urandom, then 256 KiB of zeroes, then another 64 KiB of urandom. The file is 384 KiB in size, but only 128 KiB is actually stored on disk:
```
# zfs create pool0/srv/test
# cat <(head -c 64K /dev/urandom) <(head -c 256K /dev/zero) <(head -c 64K /dev/urandom) > /srv/test/test
# du -sh /srv/test/test
145K    /srv/test/test
```
We can also examine the file’s indirect blocks with zdb:
```
# ls -li /srv/test/test
2 -rw-r--r-- 1 root root 393216 Oct 30 02:05 /srv/test/test
# zdb -ddddd pool0/srv/test 2
[...]
Indirect blocks:
               0 L1  0:1791b7d3000:1000 20000L/1000P F=2 B=9769680/9769680 cksum=[...]
               0  L0 0:1791b7b1000:11000 20000L/11000P F=1 B=9769680/9769680 cksum=[...]
           40000  L0 0:1791b7c2000:11000 20000L/11000P F=1 B=9769680/9769680 cksum=[...]

                segment [0000000000000000, 0000000000020000) size  128K
                segment [0000000000040000, 0000000000060000) size  128K
```
Here we can see only two L0 blocks allocated, each being 20000 (hex, dec = 131072) bytes logical and 11000 (hex, dec = 69632) bytes physical in size. The two L0 blocks match the two segments shown at the bottom, with the middle segment nowhere to be found.
Any block being compressed must be no larger than 7/8 of its original size after compression, otherwise the compression will not be considered worthwhile and the block saved uncompressed. […] for example, 8 KiB blocks on disks with 4 KiB disk sectors must compress to 1/2 or less of their original size.

This one should be self-explanatory.

RAIDZ

Up until now we’ve only talked about logical blocks, which are all on the higher layers of the ZFS hierarchy. RAIDZ is where physical blocks (disk sectors) really come into play and adds another field of confusion.

Unlike traditional RAID 5/6/7^(?) that combine disks into an array and presents a single volume for the filesystem, RAIDZ handles each logical block separately. I’ll cite this illustration from Delphix⁵ to explain:

Figure 2. RAID-Z block layout

This example shows a 5-wide RAID-Z1 setup.

A single-sector block takes another sector for parity, like the dark red block on row 3.
Multi-sector blocks are striped across disks, with parity sectors inserted every 4 sectors, matching the data-to-parity ratio of the vdev array.
- You may have noticed that parity sectors for the same block are always stored on the same disk that resembles RAID-4 instead of RAID-5. Keep in mind that ZFS reads, writes and verifies entire blocks, so interleaving parity sectors across disks will not provide any benefit, while keeping “stripes” on the same disk simplifies the logic for validation and reconstruction.
In order to avoid unusable fragments, ZFS requires each allocated block to be padded to a multiple of (p+1) sectors, where p is the number of parity disks. For example, RAID-Z1 requires each block to be padded to a multiple of 2 sectors, and RAID-Z2 requires each block to be padded to a multiple of 3 sectors. This can be seen on rows 7 to 9, where the X sectors are reserved for parity padding.

This design allows RAID to play well with ZFS’s log-structured design and avoids the need for read-modify-write cycles. Consequently, the RAID overhead is now dependent on your data and is no longer an intrinsic property of the RAID level and array width. The same Delphix article shares a nice spreadsheet⁶ that calculates RAID overhead for you:

Accounting

Accounting the storage space for a RAIDZ array is as problematic as it seems: There’s no way to calculate the available space in advance without knowledge on the block size pattern.

ZFS works around this by showing an estimate, assuming all data were stored as 128 KiB blocks⁷. On my test setup with five 16 GiB disks in RAID-Z1 and ashift=12, the available space shows as 61.5G, while zpool shows the raw size as 79.5G:

			# zpool create -o ashift=12 test raidz1 nvme3n1p{1,2,3,4,5}
# zfs list test
NAME   USED  AVAIL     REFER  MOUNTPOINT
test   614K  61.5G      153K  /test
# zpool list test
NAME   SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
test  79.5G   768K  79.5G        -         -     0%     0%  1.00x    ONLINE  -

		

When I increase ashift to 15 (32 KiB sectors), the available space drops quite a bit, even if zpool shows the same raw size:

			# zpool create -o ashift=15 test raidz1 nvme3n1p{1,2,3,4,5}
# zfs list test
NAME   USED  AVAIL     REFER  MOUNTPOINT
test  4.00M  51.3G     1023K  /test
# zpool list test
NAME   SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
test  79.5G  7.31M  79.5G        -         -     0%     0%  1.00x    ONLINE  -

		

In both cases, calculating the “raw” disk space from the available space gives roughly congruent results:

61.5 GiB × (1 + 25%) = 76.9 GiB
51.3 GiB × (1 + 50%) = 76.9 GiB

The default refreservation for non-sparse ZVOLs exhibits a similar behavior:

			# zfs create -V 4G -o volblocksize=8K test/v8k
# zfs create -V 4G -o volblocksize=16K test/v16k
# zfs get refreservation test/v8k test/v16k
NAME       PROPERTY        VALUE      SOURCE
test/v16k  refreservation  4.86G      local
test/v8k   refreservation  6.53G      local

		

Interestingly, neither of the refreservation sizes matches the RAID overhead as calculated in the Delphix spreadsheet⁶, as you would expect some 6.0 GiB for the 16k-volblocksized ZVOL and some 8.0 GiB for the 8k-volblocksized one. Let’s just don’t forget that the whole accounting system assumed 128 KiB blocks and scaled by that⁸. So the actual meaning of 4.86G and 6.53G would be “the equivalent space if volblocksize had been 128 KiB”. If we multiply both values by 1.25 (overhead for 128 KiB blocks and 5-wide RAIDZ), we get 6.08 GiB and 8.16 GiB of raw disk spaces respectively, both of which match more closely the expected values. The final minor difference is due to the different amount of metadata required for different number of blocks.

Thoughts

I never imagined I would delve this deep into ZFS when I first stumbled upon the question. There are lots of good write-ups on individual components of ZFS all around the web, and Chris Siebenmann’s blog in particular. But few combine all the pieces together and paint the whole picture, so I had to spend some time synthesizing them by myself. As you’ve seen in the Luster slide, ZFS is so complex a beast that it’s hard to digest in its entirety. So for now I have no idea how much effort I would put into learning it, nor any future blogs I would write. But anyways, that’s one large mystery solved, for myself and my readers (you), and time to call it a day.

References

Andreas Dilger (2010) ZFS Features & Concepts TOI ↩
Jim Salter (2020) ZFS 101 – Understanding ZFS storage and performance ↩
OpenZFS Workload Tuning ↩
Chris Siebenmann (2017) ZFS’s recordsize, holes in files, and partial blocks ↩ ↩²
Matthew Ahrens (2014) How I Learned to Stop Worrying and Love RAIDZ ↩
RAID-Z parity cost (Google Sheets) ↩ ↩²
openzfs/zfs#4599 (2016) disk usage wrong when using larger recordsize, raidz and ashift=12 ↩
Mike Gerdts (2019) (Code comment in libzfs_dataset.c) ↩

Debugging Proxmox VE Firewall Dropping TCP Reset Packets

2023-10-06T00:00:00+00:00

A few days back when I was setting up a new VM to host some extra websites, I noticed an unexpected Nginx error page. As I don’t administer the new websites, I just added reverse proxy rules on the gateway Nginx server, and deferred the actual configuration to whoever is in charge of them.

When I reviewed my edited Nginx configuration and tried visiting the new website, I received a 504 Gateway Timeout error after curl hung for a minute. Knowing that the web server had yet to be set up, I was expecting a 502 Bad Gateway error. I quickly recalled the conditions for Nginx to return these specific errors: 502 if the upstream server is immediately known down, and 504 if the upstream server is up but not responding.

Since the actual web application hadn’t been set up yet, the new VM should have nothing listening on the configured ports. Consequently, the kernel should immediately respond with a TCP Reset for any incoming connections. To verify this, I ran tcpdump on both sides to check if the TCP reset packets actually came out. To my surprise, the packets were indeed sent out from the new VM, but the gateway server received nothing. So there was certainly something wrong with the firewall. I took a glance at the output of pve-firewall compile. They were very structured and adequately easy to understand, but I couldn’t immediately identify anything wrong. Things were apparently more complicated than I had previously anticipated.

Searching for information

As usual, the first thing to try is Googling. Searching for pve firewall tcp reset brought this post on Proxmox Forum as the first result. Their symptoms were precisely the same as mine:

Assume we have a service running on TCP port 12354

Clients can communicate with it while running

While service is down, clients recieved “Connection timed out” (no answer) even if OS send TCP RST packets:

[…]

However, these RST packets are dropped somewhere in PVE firewall.
On the VM options :

Firewall > Options > Firewall = No, Has no effect

Firewall > Options > * Policy = ACCEPT, Has no effect (even with NO rule in active for this VM)

Hardware > Network Device > firewall=0, allows packets RST to pass!

I gave the last suggestion a try, and it worked! I could now see connections immediately reset on the gateway server, and Nginx started producing 502 errors. But I was still confused why this happened in the first place. The first thread contained nothing else useful, so I continued scanning through other search results and noticed another post about another seemingly unrelated problem, with a plausible solution:

[…], and the fix was just to add the nf_conntrack_allow_invalid: 1 in the host.fw for each node - I didn’t have to do anything other than that.

That seemed understandable to me, so I gave it a try as well, and to my pleasure, it also worked.

Regrettably, useful information ceased to exist online beyond this, and it was far from painting the whole picture. So anything further would have to be uncovered on my own.

Reviewing information

I reviewed the two helpful workarounds and made myself abundantly clear about their effects:

Disabling the firewall on the virtual network device stops PVE from bridging the interface an extra time, as shown in the following diagram:
Adding nf_conntrack_allow_invalid: 1 removes one single iptables rule:
```
-A PVEFW-FORWARD -m conntrack --ctstate INVALID -j DROP
```

I couldn’t figure out how the first difference was relevant, but the second one provided an important clue: The firewall was dropping TCP Reset packets because conntrack considered them invalid.

Conntrack (connection tracking) is a Linux kernel subsystem that tracks network connections and aids in stateful packet inspection and network address translation. The first packet of a connection is considered “NEW”, and subsequent packets from the same connection are considered “ESTABLISHED”, including the TCP Reset packet when it’s first seen, which causes conntrack to delete the connection entry.

There was still yet anything obvious, so time to start debugging.

Inspecting packet captures

I ran tcpdump -ni any host 172.31.0.2 and host 172.31.1.11 and tcp on the PVE host to capture packets between the two VMs. This is what I got (output trimmed):

			tcpdump: data link type LINUX_SLL2
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on any, link-type LINUX_SLL2 (Linux cooked v2), snapshot length 262144 bytes
33:11.911184 veth101i1 P   IP 172.31.0.2.50198 > 172.31.1.11.80: Flags [S], seq 3404503761, win 64240
33:11.911202 fwln101i1 Out IP 172.31.0.2.50198 > 172.31.1.11.80: Flags [S], seq 3404503761, win 64240
33:11.911203 fwpr101p1 P   IP 172.31.0.2.50198 > 172.31.1.11.80: Flags [S], seq 3404503761, win 64240
33:11.911206 fwpr811p0 Out IP 172.31.0.2.50198 > 172.31.1.11.80: Flags [S], seq 3404503761, win 64240
33:11.911207 fwln811i0 P   IP 172.31.0.2.50198 > 172.31.1.11.80: Flags [S], seq 3404503761, win 64240
33:11.911213 tap811i0  Out IP 172.31.0.2.50198 > 172.31.1.11.80: Flags [S], seq 3404503761, win 64240
33:11.911262 tap811i0  P   IP 172.31.1.11.80 > 172.31.0.2.50198: Flags [R.], seq 0, ack 3404503762, win 0, length 0
33:11.911267 fwln811i0 Out IP 172.31.1.11.80 > 172.31.0.2.50198: Flags [R.], seq 0, ack 1, win 0, length 0
33:11.911269 fwpr811p0 P   IP 172.31.1.11.80 > 172.31.0.2.50198: Flags [R.], seq 0, ack 1, win 0, length 0
^C
packets captured
packets received by filter
packets dropped by kernel

		

The first thing to notice is the ACK number. After coming from tap811i0, it suddenly became 1 with no apparent reason. I struggled on this for a good while and temporarily put it aside.

Adding nf_conntrack_allow_invalid: 1 to the firewall options and capturing packets again, I got the following:

			listening on any, link-type LINUX_SLL2 (Linux cooked v2), snapshot length 262144 bytes
46:15.243002 veth101i1 P   IP 172.31.0.2.58784 > 172.31.1.11.80: Flags [S], seq 301948896, win 64240
46:15.243015 fwln101i1 Out IP 172.31.0.2.58784 > 172.31.1.11.80: Flags [S], seq 301948896, win 64240
46:15.243016 fwpr101p1 P   IP 172.31.0.2.58784 > 172.31.1.11.80: Flags [S], seq 301948896, win 64240
46:15.243020 fwpr811p0 Out IP 172.31.0.2.58784 > 172.31.1.11.80: Flags [S], seq 301948896, win 64240
46:15.243021 fwln811i0 P   IP 172.31.0.2.58784 > 172.31.1.11.80: Flags [S], seq 301948896, win 64240
46:15.243027 tap811i0  Out IP 172.31.0.2.58784 > 172.31.1.11.80: Flags [S], seq 301948896, win 64240
46:15.243076 tap811i0  P   IP 172.31.1.11.80 > 172.31.0.2.58784: Flags [R.], seq 0, ack 301948897, win 0, length 0
46:15.243081 fwln811i0 Out IP 172.31.1.11.80 > 172.31.0.2.58784: Flags [R.], seq 0, ack 1, win 0, length 0
46:15.243083 fwpr811p0 P   IP 172.31.1.11.80 > 172.31.0.2.58784: Flags [R.], seq 0, ack 1, win 0, length 0
46:15.243086 fwpr101p1 Out IP 172.31.1.11.80 > 172.31.0.2.58784: Flags [R.], seq 0, ack 1, win 0, length 0
46:15.243087 fwln101i1 P   IP 172.31.1.11.80 > 172.31.0.2.58784: Flags [R.], seq 0, ack 1, win 0, length 0
46:15.243090 veth101i1 Out IP 172.31.1.11.80 > 172.31.0.2.58784: Flags [R.], seq 0, ack 1, win 0, length 0
^C

		

This time while the ACK number was still wrong, the RST packet somehow got through. Ignoring the ACK numbers for now, the output suggested that the RST packet was dropped between fwpr811p0 P and fwln811i0 Out. That was the main bridge vmbr0. All right then, that was where the PVEFW-FORWARD chain kicked in, so at this point the RST packet was --ctstate INVALID. Everything was logical so far.

So how about disabling firewall for the interface on VM 811?

			listening on any, link-type LINUX_SLL2 (Linux cooked v2), snapshot length 262144 bytes
19:01.812030 veth101i1 P   IP 172.31.0.2.39734 > 172.31.1.11.80: Flags [S], seq 1128018611, win 64240
19:01.812045 fwln101i1 Out IP 172.31.0.2.39734 > 172.31.1.11.80: Flags [S], seq 1128018611, win 64240
19:01.812046 fwpr101p1 P   IP 172.31.0.2.39734 > 172.31.1.11.80: Flags [S], seq 1128018611, win 64240
19:01.812051 tap811i0  Out IP 172.31.0.2.39734 > 172.31.1.11.80: Flags [S], seq 1128018611, win 64240
19:01.812178 tap811i0  P   IP 172.31.1.11.80 > 172.31.0.2.39734: Flags [R.], seq 0, ack 1128018612, win 0, length 0
19:01.812183 fwpr101p1 Out IP 172.31.1.11.80 > 172.31.0.2.39734: Flags [R.], seq 0, ack 1, win 0, length 0
19:01.812185 fwln101i1 P   IP 172.31.1.11.80 > 172.31.0.2.39734: Flags [R.], seq 0, ack 1, win 0, length 0
19:01.812190 veth101i1 Out IP 172.31.1.11.80 > 172.31.0.2.39734: Flags [R.], seq 0, ack 1, win 0, length 0
^C

		

This time fwbr811i0 was missing, and the RST packet didn’t get dropped at vmbr0. I was left totally confused.

I decided to sort out the ACK number issue, but ended up asking my friends for help. It turned out this was well documented in tcpdump(8):

-S
--absolute-tcp-sequence-numbers
Print absolute, rather than relative, TCP sequence numbers.

This certainly came out unexpected, but at least I was assured there was nothing wrong with the ACK numbers.

Up to now, that’s one more step forward, and a small conclusion:

At the point the RST packet reached vmbr0, it was already --ctstate INVALID.

But how? As far as I knew, when the RST packet came out, it should still be considered part of the connection, and thus should be --ctstate ESTABLISHED. I was still missing something.

Time to investigate conntrack.

Inspecting conntrack

conntrack is the tool to inspect and modify conntrack entries. I ran conntrack -L to list all entries, only to realize it’s inefficient. So instead, I ran conntrack -E to watch for “events” in real time, so that I could compare the output with tcpdump. Except that the entire connection concluded so quickly that I couldn’t identify anything.

I had to add artificial delays to the packets to clearly separate each hop that the RST packet goes through:

			tc qdisc add dev tap811i0 root netem delay 200ms
tc qdisc add dev fwln811i0 root netem delay 200ms

I also tuned the output on both sides to show the timestamp in a consistent format. For conntrack, -o timestamp produced Unix timestamps (which is the only supported format), so for tcpdump I also resorted to -tt to show Unix timestamps as well.

			conntrack -E -o timestamp -s 172.31.0.2 -d 172.31.1.11
tcpdump -ttSni any host 172.31.0.2 and host 172.31.1.11 and tcp

Now I could watch the outputs on two separate tmux panes. The problem immediately emerged (blank lines added for readability):

			listening on any, link-type LINUX_SLL2 (Linux cooked v2), snapshot length 262144 bytes
1696412047.886575 veth101i1 P   IP 172.31.0.2.47066 > 172.31.1.11.80: Flags [S]
1696412047.886592 fwln101i1 Out IP 172.31.0.2.47066 > 172.31.1.11.80: Flags [S]
1696412047.886594 fwpr101p1 P   IP 172.31.0.2.47066 > 172.31.1.11.80: Flags [S]
1696412047.886599 fwpr811p0 Out IP 172.31.0.2.47066 > 172.31.1.11.80: Flags [S]
1696412047.886600 fwln811i0 P   IP 172.31.0.2.47066 > 172.31.1.11.80: Flags [S]

1696412048.086620 tap811i0  Out IP 172.31.0.2.47066 > 172.31.1.11.80: Flags [S]
1696412048.086841 tap811i0  P   IP 172.31.1.11.80 > 172.31.0.2.47066: Flags [R.]

1696412048.286919 fwln811i0 Out IP 172.31.1.11.80 > 172.31.0.2.47066: Flags [R.]
1696412048.286930 fwpr811p0 P   IP 172.31.1.11.80 > 172.31.0.2.47066: Flags [R.]
^C

		

			[1696412047.886657]         [NEW] tcp      6 120 SYN_SENT src=172.31.0.2 dst=172.31.1.11 sport=47066 dport=80 [UNREPLIED] src=172.31.1.11 dst=172.31.0.2 sport=80 dport=47066
[1696412048.086899]     [DESTROY] tcp      6 119 CLOSE src=172.31.0.2 dst=172.31.1.11 sport=47066 dport=80 [UNREPLIED] src=172.31.1.11 dst=172.31.0.2 sport=80 dport=47066

The artificial delays and the timestamps were absolutely useful: It was clear that the corresponding conntrack connection was destroyed as soon as the RST packet passed through fwbr811i0, before it came out via fwln811i0. When it reached vmbr0, the connection was already gone, and the RST packet was considered invalid.

It also became explainable how firewall=0 on the virtual network device remedied the issue: It removed an extra bridge fwbr811i0, so the connection stayed alive when the RST packet reached vmbr0, at which point a previous rule for --ctstate ESTABLISHED gave an ACCEPT verdict. While it was still INVALID when passing through fwbr101i1, there was no rule concerning --ctstate at play, so it slipped through this stage with no problem.

After double-checking the intention of the extra fwbr* bridge, I drew the conclusion that this must be a bug with PVE Firewall. I reported it on the Proxmox VE bug tracker as #4983, and soon received a reply:

Thank you for the detailed write-up!

This is a known limitation for our kind of firewall setup, since the conntrack is shared between all interfaces on the host.

[…]

If you know of any other way to avoid this happening, other than using conntrack zones, I’d be happy to take a look.

So they admitted that this was a limitation but without a satisfactory solution. Guess I’m still on my own, though.

Finding the solution

The actual problem is, when passing through fwbr811i0, the RST packet isn’t supposed to be processed by conntrack by then. There is no sysctl option to disable conntrack on a specific interface (or even just all bridges altogether), but at the right time the rarely-used raw table came to my mind. It didn’t take long to work this out:

			iptables -t raw -A PREROUTING -i fwbr+ -j CT --notrack

		

After verifying this is the intended solution, I added it as a reply to the bug report. At the time of writing this blog post, the bug report is still open, but I’m sure it’s to be resolved soon.

Conclusion

Debugging Linux networking has always been a pain for its lack of proper tools and its complexity. Most of the times even reading and understanding packet captures requires immense knowledge of the protocols and all the involved components, as well as scrutinizing every single detail available. Sometimes it’s even necessary to think outside the box but fortunately not today.

Also worth mentioning is that it’s easy to suspect the fault of another piece of software, but detailed investigation is always necessary to actually lay the blame.

Just as a late reminder, useful bug reports always require detailed information and solid evidence. Glad I was able to have them at hand this time.

Running a dual-protocol OpenVPN/WireGuard VPN server on one port

2023-09-26T00:00:00+00:00

Public Wi-Fi and some campus network typically block traffic from unauthenticated clients, but more often allow traffic targeting UDP port 53 to pass through, which is normally used for DNS queries. This feature can be exploited to bypass authentication by connecting to a VPN server that’s also running on UDP 53.

In previous times, OpenVPN was the general preference for personal VPN services. Since the emergence of WireGuard, however, popularity has shifted significantly for its simplicity and performance. A challenge presents itself as there’s only one UDP port numbered 53, making it seemingly impossible to run both OpenVPN and WireGuard on the same port.

There solution hinges itself on a little bit of insights.

Inspiration

In a similar situation, many local proxy software like Shadowsocks and V2ray support a feature called “mixed mode”, which accepts both HTTP and SOCKS5 connections on the same TCP port. This also seems impossible at first glance, but with a bit of knowledge in both protocols, it’s actually easy to pull it off.

An HTTP proxy request, just like other HTTP requests, begins with an HTTP verb. In proxy requests, it’s either GET or CONNECT,
A SOCKS proxy request begins with a 1-byte header containing its version, which is 0x04 for SOCKS4 or 0x05 for SOCKS5.

Now there’s a clear line between the two protocols, and we can identify them by looking at the first byte of the request. This is how most proxy implementations work, like 3proxy and glider.

So the question is, is there a similar trait between OpenVPN and WireGuard? The answer is, as you would expect, yes.

Protocols

WireGuard runs over UDP and defines 4 packet types: 3 for handshake and 1 for data. All 4 packet types share the same 4-byte header:

			struct message_header {
    u8 type;
    u8 reserved_zero[3];
}

		

Similarly, all OpenVPN packet types share the same 1-byte header:

			struct header_byte {
    uint8_t opcpde : 5;
    uint8_t key_id : 3;
}

		

It’s worth noting that 0 is not a defined opcode, so the smallest valid value for this byte is 8, as key_id can be anything from 0 to 7.

Implementation

Now that we have the packet format for both protocols understood, we can implement a classifier that filters traffic in one protocol from the other.

Considering that the WireGuard packet format is much simpler than that of OpenVPN, I choose to identify WireGuard. With kernel firewall iptables, options are abundant, though I find u32 the easiest:

			*nat
:iBugVPN - [0:0]
-A PREROUTING -m addrtype --dst-type LOCAL -p udp --dport 53 -j iBugVPN
-A iBugVPN -m u32 --u32 "25 & 0xFF = 1:4 && 28 & 0xFFFFFF = 0" -j REDIRECT --to-port 51820
-A iBugVPN -j REDIRECT --to-port 1194
COMMIT

		

With both OpenVPN and WireGuard running on their standard ports, this will redirect each protocol to its respective service port. While these rules only operate on the initial packet, Linux conntrack will handle the rest of the connection.

The u32 match is explained:

Basic syntax: [operators...] = , where is relative to the IP header. For UDP over IPv4, the application payload starts from 28 (20 bytes of IPv4 and 8 bytes of UDP)
25 & 0xFF = 1:4: The 28th byte is in range 1:4.
28 & 0xFFFFFF = 0: The 29th to 31th bytes are all zero.

For IPv6, you just need to increase the offset by 20 (IPv6 header is 40 bytes), so the rule becomes 45 & 0xFF = 1:4 && 48 & 0xFFFFFF = 0.

This VPN server is running like a hearse so proofs are left out for brevity.