I was using timers, and I was getting insanely different times for the same code, going anywhere from 0ms to 20ms without any obvious changes to the environment or anything.
I was banging my head against it for hours, until I realized that async code is weird. Async code isn’t directly “run”, it’s “scheduled” and the calling thread can yield until we get the result. By trying to do microbenchmarks, I wasn’t really testing “my code”, I was testing the .NET scheduler.
It was my first glimpse into seeing why benchmarking is deceptively hard. I think about it all the time whenever I have to write performance tests.
There was dedicated team of folks that had built a random forest model to predict what the latency SHOULD be based on features like:
- trading team
- exchange
- order volume
- time of day
- etc
If that system detected change/unexpected spike etc, it would fire off an alert and then it was my job, as part of the trade support desk, to go investigate why e.g. was it a different trading pattern, did the exchange modify something etc etc
One day, we get an alert for IEX (of Flash Boys fame). I end up on the phone with one of our network engineers and, from IEX, one of their engineers and their sales rep for our company.
We are describing the change in latency and the sales rep drops his voice and says:
"Bro, I've worked at other firms and totally get why you care about latency. Other exchanges also track their internal latencies for just this type of scenario so we can compare and figure out the the issue with the client firm. That being said, given who we are and our 'founding story', we actually don't track out latencies so I have to just go with your numbers."
* mean/med/p99/p999/p9999/max over day, minute, second, 10ms
* software timestamps of rdtsc counter for interval measurements - am17 says why below
* all of that not just on a timer - but also for each event - order triggered for send, cancel sent, etc - for ease of correlation to markouts.
* hw timestamps off some sort of port replicator that has under 3ns jitter - and a way to correlate to above.
* network card timestamps for similar - solar flare card (amd now) support start of frame to start of Ethernet frame measurements.
The number one mistake I see people make is measuring one time and taking the results at face value. If you do nothing else, measure three times and you will at least have a feeling for the variability of your data. If you want to compare two versions of your code with confidence there is usually no way around proper statistical analysis.
Which brings me to the second mistake. When measuring runtime, taking the mean is not a good idea. Runtime measurements usually skew heavily towards a theoretical minimum which is a hard lower bound. The distribution is heavily lopsided with a long tail. If your objective is to compare two versions of some code, the minimum is a much better measure than the mean.
The other thing is that L1/L2 switches provide this functionality, of taking switch timestamps and marking them, which is the true test of e2e latency without any clock drift etc.
Also, fast code is actually really really hard, you just to create the right test harness once
And measuring is hard. This us why consistently fast code is hard.
In any case, adding some crude performance testing into your CI/CD suite, and signaling a problem if a test ran for much longer than it used to, is very helpful at quickly detecting bad performance regressions.
The article states the opposite.
> Writing fast algorithmic trading system code is hard. Measuring it properly is even harder.