FRESH

Hacker News

Home

An eBPF loophole: Using XDP for egress traffic

241 points by shivanshvij

by shivanshvij

2 subcomments

XDP (eXpress Data Path) is the fastest packet processing framework in linux - but it only works for incoming (ingress) traffic. We discovered how to use it for outgoing (egress) traffic by exploiting a loophole in how the linux kernel determines packet direction. Our technique delivers 10x better performance than current solutions, works with existing Docker/Kubernetes containers, and requires zero kernel modifications.
This post not only expands on the overall implementation but also outlines how existing container and VM workloads can immediately take advantage with minimal effort and zero infrastructure changes.

by tptacek

2 subcomments

From 2022: https://www.samd.is/2022/06/13/egress-XDP.html
You can also use XDP for outgoing packets for tap interfaces.

by cptnntsoobv

3 subcomments

XDP, and the eBPF ecosystem in general, is quite neat. However, a word of caution:
* The BPF verifier's DX is not great yet. If it finds problems with your BPF code it will spit our a rather inscrutable set of error messages that often requires a good understanding of the verifier internals (e.g the register nomenclature) to debug
* For the same source code, the code generated by the verifier can change across compiler versions in a breaking way, e.g. because the new compiler version implemented an optimization that broke the verifier (see https://github.com/iovisor/bcc/issues/4612)
* Checksum updating requires extra care. I believe you can only do incremental updates, not just because of better perf as the post suggests but also because the verifier does not allow BPF programs to operate on unbounded buffers (so checksumming a whole packet of unknown size is tricky / cumbersome). This mostly works but you have to be careful with packets that were generated with csum offload, don't have a valid checksum and whose csum can't be incrementally updated.
As the blog post points out, the kernel networking stack does a lot of work that we don't generally think about. Once you start taking things into your own hands you don't have the luxury of ignorance anymore (think not just ARP but also MTU, routing, RP filtering etc.), something any user of userspace networking frameworks like DPDK will tell you.
My general recommendation is to stick with the kernel unless you have a very good justification for chasing better performance and if you do use eBPF save yourself some trouble and try to limit yourself to readonly operations, if your use case allows.
Also, if you are trying to debug packet drops, newer kernels have started logging this information that you can track using bpftrace which gives you better diagnostics.
Example script (might have to adjust based on kernel version):
```
    bpftrace -e '
        kprobe:kfree_skb_reason {
        $skb = (struct sk_buff *)arg0;
        $ipheader = ((struct iphdr *) ($skb->head + $skb->network_header));
        printf("reason :%d %s -> %s\n", arg1, ntop($ipheader->saddr), ntop($ipheader->daddr));
    }'
```

by AlexB138

1 subcomments

Really good, and glad that you're taking this technique further into a docker network plugin. I wouldn't be surprised to see a Kubernetes CNI appear using this approach, seems entirely viable unless I am missing something.
I'll definitely be coming to check you all out at Kubecon.

by joshstrange

2 subcomments

For some reason at above 1600px wide the content starts to shrink and become unreadable.
Video: https://cs.joshstrange.com/Zhxk4kRp

by phineyes

0 subcomment

21G on tc egress is slightly surprising to me. I'd like to see the program used for the benchmark. Was GSO accounted for? If you pop/pull headers by hand, you'll often kill GSO which will result in a massive loss in throughput like this.

by ZiiS

1 subcomments

They say "By the time a packet reaches the TC hook, the kernel has already processed it through various subsystems for routing, firewalling, and even connection tracking." but surely this is also true before it reaches the VETH?

by blipvert

0 subcomment

Hardly new. Used this years ago for directing a “NAT” address to a virtual IP with a specific MAC address to do health checking.

by shivanshvij

2 subcomments

Hi HN, Shivansh (founder) here, happy to answer any questions folks might have about the implementation and the benchmarks!

by r3tr0

0 subcomment

XDP is awesome.
We use it quite a ton for capturing and dashboarding inbound network traffic over at https://yeet.cx
I am really excited for the future of eBPF especially with tcx now being available in Debian 13. The tc API was very hard to work with.

by notherhack

1 subcomments

For NAT (Network Address Translation) or any other packet header modifications, you need to recalculate checksums manually
Why doesn’t checksum offload in the NIC take care of that?

by ZiiS

2 subcomments

I understand they are attached to the phrase "loophole" but it feels fairly like they are using it as designed to me?

by WatchDog

0 subcomment

Presumably you don’t need to handle traffic at line speed, you just need to process it faster than userspace applications can produce and consume it.
What I don’t really understand is why iptables and tv is so slow.
If the kernel can’t route packets at line speed, how are userspace applications saturating it?

by docapotamus

0 subcomment

Great post.
In some scenarios veth is being replaced with netkit for a similar reason. Does this impact how you're going to manage this?

by kosolam

1 subcomments

Hey I can’t browse the link crashes on ios

by zygentoma

1 subcomments

The page has no text for me (only the table of contents on the side, that updates by scrolling over the completely blank purple page …)
I'm using Firefox

by sim7c00

0 subcomment

i really love this one. its a really elegant and well informed solution. one of the nicest finds ive seen in a while was a pleasure reading how it works! thanks a lot

0 subcomment

by toprerules

0 subcomment

I think the title is a little disingenuous and the idea of using a redirect is certainly not novel. The solution for XDP egress should be able to handle all host egress including sr-iov traffic. This works with a very specific namespace driven topology.

by iSloth

3 subcomments

Also wondering, why not just use DPDK?

by okelahbos28

0 subcomment

[flagged]

by betaby

0 subcomment

As I understand they implemented NAT using eBPF?