I run a Debian-based software router on my home network. It’s connected to multiple ISPs, so I have some policy routing rules to balance the traffic between them. Some time ago, I noticed that the IPv4 connectivity got stuck intermittently when it didn’t use to, while IPv6 was working fine. It’s also interesting that the issue only happened with one specific ISP, in the egress direction, and only a few specific devices were affected.

At first I suspected the ISP’s equipment, but a clue quickly dismissed that suspicion: Connection to the same ISP worked fine when initiated from the router itself, as well as many other unaffected devices. So the issue must be within the router.

As usual, every network debugging begins with a packet capture. I start tcpdump on both the LAN interface and the problematic WAN interface, then try curl-ing something from an affected device. Packet capture shows a few back-and-forth packets, then the device keeps sending packets but the router doesn’t forward them to the WAN interface anymore. Time for a closer look.

Identifying the issue

On an affected device, curl gets stuck somewhere in the middle:

$ curl -vso /dev/null https://www.cloudflare.com/
*   Trying 104.16.124.96:443...
* Connected to www.cloudflare.com (104.16.124.96) port 443 (#0)
* ALPN: offers h2,http/1.1
} [5 bytes data]
* TLSv1.3 (OUT), TLS handshake, Client hello (1):
} [512 bytes data]
*  CAfile: /etc/ssl/certs/ca-certificates.crt
*  CApath: /etc/ssl/certs
{ [5 bytes data]
* TLSv1.3 (IN), TLS handshake, Server hello (2):
{ [122 bytes data]
^C

tcpdump shows nothing special:

# tcpdump -ni any 'host 104.16.124.96 and tcp port 443'
02:03:47.403905 lan0 In  IP 172.17.0.2.49194 > 104.16.124.96.443: Flags [S], seq 1854398703, win 65535, options [mss 1460,sackOK,TS val 1651776756 ecr 0,nop,wscale 10], length 0
02:03:47.403956 ppp0 Out IP 10.250.193.4.49194 > 104.16.124.96.443: Flags [S], seq 1854398703, win 65535, options [mss 1432,sackOK,TS val 1651776756 ecr 0,nop,wscale 10], length 0
02:03:47.447663 ppp0 In  IP 104.16.124.96.443 > 10.250.193.4.49194: Flags [S.], seq 1391350792, ack 1854398704, win 65535, options [mss 1460,sackOK,TS val 141787839 ecr 1651776756,nop,wscale 13], length 0
02:03:47.447696 lan0 Out IP 104.16.124.96.443 > 172.17.0.2.49194: Flags [S.], seq 1391350792, ack 1854398704, win 65535, options [mss 1460,sackOK,TS val 141787839 ecr 1651776756,nop,wscale 13], length 0
02:03:47.447720 lan0 In  IP 172.17.0.2.49194 > 104.16.124.96.443: Flags [.], ack 1, win 64, options [nop,nop,TS val 1651776800 ecr 141787839], length 0
02:03:47.452705 lan0 In  IP 172.17.0.2.49194 > 104.16.124.96.443: Flags [P.], seq 1:518, ack 1, win 64, options [nop,nop,TS val 1651776804 ecr 141787839], length 517
02:03:47.452751 ppp0 Out IP 10.250.193.4.49194 > 104.16.124.96.443: Flags [P.], seq 1:518, ack 1, win 64, options [nop,nop,TS val 1651776804 ecr 141787839], length 517
02:03:47.496507 ppp0 In  IP 104.16.124.96.443 > 10.250.193.4.49194: Flags [.], ack 518, win 9, options [nop,nop,TS val 141787888 ecr 1651776804], length 0
02:03:47.496527 lan0 Out IP 104.16.124.96.443 > 172.17.0.2.49194: Flags [.], ack 518, win 9, options [nop,nop,TS val 141787888 ecr 1651776804], length 0
02:03:47.498147 ppp0 In  IP 104.16.124.96.443 > 10.250.193.4.49194: Flags [P.], seq 1:2737, ack 518, win 9, options [nop,nop,TS val 141787890 ecr 1651776804], length 2736
02:03:47.498165 lan0 Out IP 104.16.124.96.443 > 172.17.0.2.49194: Flags [P.], seq 1:2737, ack 518, win 9, options [nop,nop,TS val 141787890 ecr 1651776804], length 2736
02:03:47.498175 lan0 In  IP 172.17.0.2.49194 > 104.16.124.96.443: Flags [.], ack 2737, win 70, options [nop,nop,TS val 1651776850 ecr 141787890], length 0
02:03:47.498195 ppp0 In  IP 104.16.124.96.443 > 10.250.193.4.49194: Flags [P.], seq 2737:3758, ack 518, win 9, options [nop,nop,TS val 141787890 ecr 1651776804], length 1021
02:03:47.498228 ppp0 Out IP 10.250.193.4.49194 > 104.16.124.96.443: Flags [R], seq 1854399221, win 0, length 0
^C
711 packets captured
720 packets received by filter
0 packets dropped by kernel

Considering the complexity of the policy routing, I tried inspecting conntrack status in parallel. Nothing unusual there either, until I tried matching conntrack events with tcpdump:

# conntrack -E -s 172.17.0.2 -p tcp --dport 443 2>/dev/null | ts %.T
02:03:47.404103     [NEW] tcp      6 120 SYN_SENT src=172.17.0.2 dst=104.16.124.96 sport=49194 dport=443 [UNREPLIED] src=104.16.124.96 dst=10.250.193.4 sport=443 dport=49194
02:03:47.447748  [UPDATE] tcp      6 60 SYN_RECV src=172.17.0.2 dst=104.16.124.96 sport=49194 dport=443 src=104.16.124.96 dst=10.250.193.4 sport=443 dport=49194 mark=48
02:03:47.447843 [DESTROY] tcp      6 432000 ESTABLISHED src=172.17.0.2 dst=104.16.124.96 sport=49194 dport=443 src=104.16.124.96 dst=10.250.193.4 sport=443 dport=49194 [ASSURED] mark=48
02:03:47.452798     [NEW] tcp      6 300 ESTABLISHED src=172.17.0.2 dst=104.16.124.96 sport=49194 dport=443 [UNREPLIED] src=104.16.124.96 dst=10.250.193.4 sport=443 dport=49194
02:03:47.496572  [UPDATE] tcp      6 300 src=172.17.0.2 dst=104.16.124.96 sport=49194 dport=443 src=104.16.124.96 dst=10.250.193.4 sport=443 dport=49194 mark=48
02:03:47.498195  [UPDATE] tcp      6 300 src=172.17.0.2 dst=104.16.124.96 sport=49194 dport=443 src=104.16.124.96 dst=10.250.193.4 sport=443 dport=49194 [ASSURED] mark=48
02:03:47.498243 [DESTROY] tcp      6 432000 ESTABLISHED src=172.17.0.2 dst=104.16.124.96 sport=49194 dport=443 src=104.16.124.96 dst=10.250.193.4 sport=443 dport=49194 [ASSURED] mark=48
^C

With ts (from moreutils) adding timestamps to conntrack events, I can see that the conntrack entry is destroyed right after (+123μs) the second packet comes in from the device. Subsequent packets causes (+93μs) the same conntrack entry to be recreated, so curl could somehow complete the SSL handshake to a point where it only sends one packet and nothing afterwards for the connection to be recreated for a third time.

Clearly the second packet should be considered ESTABLISHED by conntrack and makes no sense to trigger a DESTROY event. I’m at a loss here and start trying random things hoping to find a clue. I tried downgrading the kernel to 5.10 (from Bullseye) and upgrading to 6.9 (from Bookworm backports), but nothing changed, eliminating the possibility of a kernel bug.

After scrutinizing my firewall rules, I noticed a small difference between IPv4 and IPv6 rules:

# rules.v4
*nat
# ...
-A POSTROUTING -o ppp+ -j MASQUERADE
COMMIT

*mangle
:PREROUTING ACCEPT [0:0]
# ...
-A PREROUTING -j CONNMARK --restore-mark
-A PREROUTING -m mark ! --mark 0 -j ACCEPT
#A PREROUTING -m conntrack --ctstate NEW,RELATED -j MARK --set-xmark 0x100/0x100
-A PREROUTING -m mark --mark 0/0xff -j ExtraConn
-A PREROUTING -m mark --mark 0/0xff -j IntraConn
-A PREROUTING -m mark --mark 0/0xff -j MARK --set-xmark 0x30/0xff
-A PREROUTING -j CONNMARK --save-mark
# ...
-A ExtraConn -i ppp0 -j MARK --set-xmark 0x30/0xff
# ...
-A IntraConn -s 172.17.0.2/32 -j iBugOptimized
# ...
-A iBugOptimized -j MARK --set-xmark 0x36/0xff
-A iBugOptimized -j ACCEPT
COMMIT

However, rules.v6 is missing the last rule in iBugOptimized, and IPv6 is somehow exempt from the conntrack issue. Removing this extra ACCEPT rule from rules.v4 fully restores the connectivity. So this is certainly the cause, but how is it related to the actual issue?

Investigation

I know there are some decent tools on GitHub that aids in debugging iptables, which is notorious for its complexity. But since I wrote the entire firewall rule set and am still maintaining it by hand, I’m going for the hard route of watching and understanding every single rule.

The difference for that single ACCEPT rule is, it skips the --save-mark step, so the assigned firewall mark is not saved to its corresponding conntrack entry. When a reply packet comes in, conntrack has nothing for the --restore-mark step, so the packet gets assigned the “default” mark of 0x30 and then this value gets saved. I should have noticed the wrong conntrack mark earlier, as conntrack -L clearly showed a mark of 48 instead of the intended 54 (0x36 from iBugOptimized). This narrows the cause down to a discrepancy between the packet mark and the conntrack mark.

Firewall marks are a more flexible way to implement slightly complicated policy-based routing, as it defers the routing decision to the mangle/PREROUTING chain instead of the single-chain global routing rules. In my case, every ISP gets assigned a fwmark routing rule like this:

9:      from all fwmark 0x30/0xff lookup eth0 proto static
9:      from all fwmark 0x31/0xff lookup eth1 proto static
9:      from all fwmark 0x36/0xff lookup ppp0 proto static

Presumably, subsequent packets from the same connection should be routed to eth0 because it has the mark 0x30 restored from conntrack entry. This is not the case, however, as tcpdump shows nothing on eth0 and everything on ppp0.

Unless there’s some magic in the kernel for it to decide to destroy a connection simply for a packet mark mismatch, this is not close enough to the root cause. Verifying the magic is relatively easy:

iptables -I PREROUTING -s 172.17.0.2/32 -j Test
iptables -A Test -m conntrack --ctstate NEW -j MARK --set-xmark 0x36/0xff
iptables -A Test -m conntrack --ctstate ESTABLISHED -j MARK --set-xmark 0x30/0xff

This time, even if conntrack shows no mark (i.e. zero) on the connection, the packets are still routed correctly to ppp0, and curl gets stuck as the same place as before. So the kernel doesn’t care about the conntrack mark at all.

Unfortunately, this is about as far as userspace inspection can go. I need to find out why exactly the kernel decides to destroy the conntrack entry.

bpftrace comes in

I’ve seen professional kernel network developers extensively running bpftrace to debug network issues (THANK YOU to the guy behind the Telegram channel Welcome to the Black Parade), so I’m giving it a try.

First thing is to figure out what to hook. Searching through Google did not reveal a trace point for conntrack events, but I get to know about the conntrack path. With help from ChatGPT, I begin with kprobe:nf_ct_delete and putting together all struct definitions starting from struct nf_conn:

#include <linux/socket.h>
#include <net/netfilter/nf_conntrack.h>

kprobe:nf_ct_delete
{
    // The first argument is the struct nf_conn
    $ct = (struct nf_conn *)arg0;

    // Check if the connection is for IPv4
    if ($ct->tuplehash[0].tuple.src.l3num == AF_INET) {
        $src_ip = $ct->tuplehash[0].tuple.src.u3.ip;
        $dst_ip = $ct->tuplehash[0].tuple.dst.u3.ip;
        printf("Conntrack destroyed (IPv4): src=%s dst=%s\n",
                ntop($src_ip), ntop($dst_ip));
    }
}

Seems all good, except it won’t compile:

ERROR: Can not access field 'u3' on expression of type 'none'
        $dst_ip = $ct->tuplehash[0].tuple.dst.u3.ip;
                  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~

After another half-hour of struggling and bothering with ChatGPT, I gave up trying to access the destination tuple, and thought I’d be fine with inspecting the stack trace:

#include <linux/socket.h>
#include <net/netfilter/nf_conntrack.h>

kprobe:nf_ct_delete
{
    // The first argument is the struct nf_conn
    $ct = (struct nf_conn *)arg0;

    // Check if the connection is for IPv4
    if ($ct->tuplehash[0].tuple.src.l3num == AF_INET) {
        $tuple_orig = $ct->tuplehash[0].tuple;
        $src_ip = $tuple_orig.src.u3.ip;
        $src_port_n = $tuple_orig.src.u.all;
        $src_port = ($src_port_n >> 8) | (($src_port_n << 8) & 0x00FF00);
        if ($src_ip != 0x020011ac) {
            return;
        }
        $mark = $ct->mark;
        printf("Conntrack destroyed (IPv4): src=%s sport=%d mark=%d\n",
                ntop($src_ip), $src_port, $mark);

        printf("%s\n", kstack());
    }
}

Noteworthy is that I have to filter the connections in the program, otherwise my screen gets flooded with unrelated events.

The output comes promising:

Attaching 1 probe...
Conntrack destroyed (IPv4): src=172.17.0.2 sport=39456 mark=0 proto=6

    nf_ct_delete+1
    nf_nat_inet_fn+188
    nf_nat_ipv4_out+80
    nf_hook_slow+70
    ip_output+220
    ip_forward_finish+132
    ip_forward+1296
    ip_rcv+404
    __netif_receive_skb_one_core+145
    __netif_receive_skb+21
    netif_receive_skb+300
    ...

Reading the source code from the top few functions of the call stack:

// net/netfilter/nf_nat_proto.c

static unsigned int
nf_nat_ipv4_out(void *priv, struct sk_buff *skb,
                const struct nf_hook_state *state)
{
#ifdef CONFIG_XFRM
    const struct nf_conn *ct;
    enum ip_conntrack_info ctinfo;
    int err;
#endif
    unsigned int ret;

    ret = nf_nat_ipv4_fn(priv, skb, state); // <-- call to nf_nat_ipv4_fn
#ifdef CONFIG_XFRM
    if (ret != NF_ACCEPT)
        return ret;
// net/netfilter/nf_nat_proto.c

static unsigned int
nf_nat_ipv4_fn(void *priv, struct sk_buff *skb,
               const struct nf_hook_state *state)
{
    // ...

    return nf_nat_inet_fn(priv, skb, state);
}
// net/netfilter/nf_nat_core.c

unsigned int
nf_nat_inet_fn(void *priv, struct sk_buff *skb,
               const struct nf_hook_state *state)
{
    // ...
        if (nf_nat_oif_changed(state->hook, ctinfo, nat, state->out))
            goto oif_changed;
    // ...

oif_changed:
    nf_ct_kill_acct(ct, ctinfo, skb);
    return NF_DROP;
}

As far as function inlining goes, there’s only one way nf_nat_inet_fn calls into nf_ct_delete, which is through nf_ct_kill_acct. And the only reason for that is nf_nat_oif_changed.

Conclusion

Now everything makes sense. With a badly placed ACCEPT rule, the conntrack connection gets saved a different mark than desired, and then destroyed because subsequent packets are routed differently for having the wrong mark. The timestamp difference of related events also roughly matches up the distance of the code path. It also must be a NAT’ed connection, as this way of nf_ct_delete is only reachable when the packet is about to be sent to the egress interface.

Leave a comment