In a somewhat self-deprecating, but also an “anyone can do this if they have the will” fashion, I tend to say my real skill in performance tuning is stubbornness. Orphaned gains can be easily responsible for 1/2 of the runtime of your average operation, sometimes as high as 2/3. The first 20% is fairly straightforward, the rest is just hard work. There’s always four smartasses willing to misquote a Knuth aphorism they never understood, but the fact is if your customers hate how slow the product is, and more to the point, if your marketing people are talking about it, it’s time to get off your asses and do the hard work, because your shit is broken.
Yes there’s specialized knowledge but is it really any harder than understanding how databases work? Probably not. But if what is described I’m the quote above is a “legend”, then either things have got a lot worse while I wasn’t looking, or I’m underplaying the value of those skills in the overall mix, or both.
The biggest memory pressure problem I ever fixed, I deduped running the same DB query twice and intersecting two filters of that data. I was expecting about a 2x improvement, but instead I got 10x. In retrospect I should have hoped for maybe as high as 4x due to the extra short circuiting on the data set, but not 10x. That was all better data locality.
The irony of that situation is that I had been pulled in as a fixer and asked if I could figure out how to get a request taking thirty seconds down to three. This was supposed to be my opening salvo in a several week drill down and a series of improvements going from 15 to 10 to 8 to 7 seconds etc and prepping to apologize when I was only able to get it down to five, maybe six seconds. And instead I was done in like two days. But it did serve my larger thesis that we were using caching wrong. Data passed on the call stack doesn’t have to be looked up a second time. Or a third, or a fifth.
Everything was slow because sorting was taking a lot of time. Sorting was slow because its comparator was taking ~6 read locks on every comparison, and was cloning large structures to avoid holding the lock for a long time. The first fix was to access just the information needed to avoid the clones, the second fix was to cache exactly the data needed for sorting after the underlying data was updated, and use that for the comparators without needing to take the underlying lock.
I'm looking forward to the next post about how cache consistency is tough.
1. They represent a single room change with this sequence of three operations:
VectorDiff::Set { index: 3, value: new_room } because of the new “preview”,
VectorDiff::Remove { index: 3 } to remove the room… immediately followed by
VectorDiff::PushFront { value: new_room } to insert the room at the top of the Room List.
and I don't see any mention of atomic sequences. I think the room will momentarily disappear from view before being placed into the correct spot. That kind of thing would drive me nuts as a user. It suggests to me this is not the right abstraction.Also, if you are actually representing the result with a vector, it's O(n), so from a performance perspective, it's not great if the vector can be large: you're shifting everything from [3, n) one spot forward and then one spot back, unnecessarily. If there were a `VectorDiff::Move`, you'd only be shifting 3 elements (the distance moved). Could still be the full length of the list but probably usually not? Something like a `BTreeSet` would make it actually O(lg n).
2. Taking a lock in a comparison function (they call it `Sorter`, but the name is wrong) is a smell for correctness as well as performance. Can the values change mid-sort? Then the result is non-deterministic. (In C++ it's actually undefined behavior to use a non-deterministic comparator. In Rust it's safe but still a bad idea.) You just can't sort values while they're changing, full stop, so inner mutability in a list you're sorting is suss. [edit: and for what? within a client, are you seriously doing heavy mutations on many rooms at once? or is a single lock on all the rooms sufficient?]
3. The sorted adapter just degrades to insertion sort of changes right here: <https://docs.rs/eyeball-im-util/0.10.0/src/eyeball_im_util/v...> and decomposes what could have been an atomic operation (append) into several inserts. Even `Set` does a linear scan and then becomes a (non-atomic again) remove and an insert, because it can change the sort order.
4. The `.sort_by(new_sorter_lexicographic(vec![Box(...), Box(...), Box(...)]))` means that it's doing up to three dynamic dispatches on each comparison. The `new_sorter_lexicographic` is trivial, so inline those instead. And definitely don't take a separate lock on each, yuck, although see above anyway about how you just shouldn't have locks within the vec you're sorting.
I would never use these abstractions.
This whole thing is summed up by some pretty basic physics. What you actually want to minimize is the communication of information between physical cores. Nothing else really matters. Certainly not terminology or clever tricks that effectively try to cheat thermodynamics. The cost of communicating information is almost always much more substantial than the cost of computing over that same information. The ALU is not nearly as power hungry as infinity fabric.