This post not only expands on the overall implementation but also outlines how existing container and VM workloads can immediately take advantage with minimal effort and zero infrastructure changes.
You can also use XDP for outgoing packets for tap interfaces.
* The BPF verifier's DX is not great yet. If it finds problems with your BPF code it will spit our a rather inscrutable set of error messages that often requires a good understanding of the verifier internals (e.g the register nomenclature) to debug
* For the same source code, the code generated by the verifier can change across compiler versions in a breaking way, e.g. because the new compiler version implemented an optimization that broke the verifier (see https://github.com/iovisor/bcc/issues/4612)
* Checksum updating requires extra care. I believe you can only do incremental updates, not just because of better perf as the post suggests but also because the verifier does not allow BPF programs to operate on unbounded buffers (so checksumming a whole packet of unknown size is tricky / cumbersome). This mostly works but you have to be careful with packets that were generated with csum offload, don't have a valid checksum and whose csum can't be incrementally updated.
As the blog post points out, the kernel networking stack does a lot of work that we don't generally think about. Once you start taking things into your own hands you don't have the luxury of ignorance anymore (think not just ARP but also MTU, routing, RP filtering etc.), something any user of userspace networking frameworks like DPDK will tell you.
My general recommendation is to stick with the kernel unless you have a very good justification for chasing better performance and if you do use eBPF save yourself some trouble and try to limit yourself to readonly operations, if your use case allows.
Also, if you are trying to debug packet drops, newer kernels have started logging this information that you can track using bpftrace which gives you better diagnostics.
Example script (might have to adjust based on kernel version):
bpftrace -e '
kprobe:kfree_skb_reason {
$skb = (struct sk_buff *)arg0;
$ipheader = ((struct iphdr *) ($skb->head + $skb->network_header));
printf("reason :%d %s -> %s\n", arg1, ntop($ipheader->saddr), ntop($ipheader->daddr));
}'I'll definitely be coming to check you all out at Kubecon.
We use it quite a ton for capturing and dashboarding inbound network traffic over at https://yeet.cx
I am really excited for the future of eBPF especially with tcx now being available in Debian 13. The tc API was very hard to work with.
Why doesn’t checksum offload in the NIC take care of that?
What I don’t really understand is why iptables and tv is so slow.
If the kernel can’t route packets at line speed, how are userspace applications saturating it?
In some scenarios veth is being replaced with netkit for a similar reason. Does this impact how you're going to manage this?
I'm using Firefox