An adveserial package can claim to have a 255 tagged integer but not actually have any followup, tricking the payload parser into an incorrect offset and reading straight off into followup memory.
It's a classic thing to check for when dealing with variable length strings or binary, but it may not cross the mind when it's hiding in the Bijou64_decode(*buff, *cr) function.
But testing proved that when you move to SIMD instructions, ULEB128 (https://github.com/kstenerud/bonjson/blob/main/bonjson.md#ty...) or sentinel values (https://github.com/kstenerud/bonjson/blob/main/bonjson.md#lo...) win every time because of the parallelization opportunities.
The true irony is that even SIMD text parsing would outperform this! SIMD is that powerful.
The downside is the encoding size. LEB128 quickly grows to 2 bytes, but stays at 2 bytes all the way to 2^14. This is important if you're using these numbers as tags/identifiers as we were in the multicodec [1] project, or for network message lengths. bijou64 only gives you 500 <= 2 byte numbers.
This looks neat, but if encoding/decoding performance is important, payload size isn't and the integer is bounded, I would just put a fixed-size integer into the payload as-is.
LEB128 (and JSON for that matter) can encode integer values of arbitrary length. This doesn't, which may or may not be important but it's different.
I'll admit that I do not do any cryptographic work with my library and therefore canonical representations aren't a huge concern in my use-cases. I merely provide various configurable limits (max value length, max depth, max items per collection) in an effort to prevent infinitely long documents from hogging my tokenizers indefinitely.
https://news.ycombinator.com/item?id=44456073 - Corrected UTF-8 (2025-07-03, 54 comments)
This "corrected UTF-8" has other problems, but I thought it's interesting how the shifted-offset idea carries over.
Values 0-127 are a single byte, but if that first byte has the continuation bit set, not only does that indicate the next byte has 7 more bits to contribute, it also moves the base up to the next window.
10000000 00000000 is the only way to represent 128.
10000000 10000000 00000000 is the only way to represent 16512.
Does this encoding have a name?
The upsides: the size of the integer is apparent upon reading the first byte, and every number has exactly one canonical representation. I wish C strings had been standardized around something similar, instead on null termination.
> ...adversarial input, which is rarely in the test suite.
This made my scratch my head. My tests for quite pedestrian APIs often contain adversarial input of obvious shapes. I though that for anything security-related (like the author's project) testing against adversarial input would be be a prominent part.
> This causes problems for signed data if you ever want to do things like compression since you need to know the exact bytes that were signed.
If you are verifying a signature by taking some logical data structure, turning it into a byte string, and calling the verification primitive on those bytes, you likely have a design error. You should instead collect bytes, verify the signature, and then parse the bytes after verifying the signature. And remember to include enough context in those bytes so a different message signed for a different purpose by the same key doesn’t confuse you.