""" Continuing with the previous example of “ß”, one has lowercase("ss") != lowercase("ß") but uppercase("ss") == uppercase("ß"). Conversely, for legacy reasons (compatibility with encodings predating Unicode), there exists a Kelvin sign “K”, which is distinct from the Latin uppercase letter “K”, but also lowercases to the normal Latin lowercase letter “k”, so that uppercase("K") != uppercase("K") but lowercase("K") == lowercase("K").
The correct way is to use Unicode case folding, a form of normalization designed specifically for case-insensitive comparisons. Both casefold("ß") == casefold("ss") and casefold("K") == casefold("K") are true. Case folding usually yields the same result as lowercasing, but not always (e.g., “ß” lowercases to itself but case-folds to “ss”). """
One question I have is why have Kelvin sign that is distinct from Latin K and other indistinguishable symbols? To make quantified machine readable (oh, this is not a 100K license plate or money amount, but a temperature)? Or to make it easier for specialized software to display it in correct placed/units?
I grouped all Unicode 17 case-folding rules and built ~3K lines of AVX-512 kernels around them to enable fully standards-compliant, case-insensitive substring search across the entire 1M+ Unicode range, operating directly on UTF-8 bytes. In practice, this is often ~50× faster than ICU, and also less wrong than most tools people rely on today—from grep-style utilities to products like Google Docs, Microsoft Excel, and VS Code.
StringZilla v4.5 is available for C99, C++11, Python 3, Rust, Swift, Go, and JavaScript. The article covers the algorithmic tradeoffs, benchmarks across 20+ Wikipedia dumps in different languages, and quick starts for each binding.
Thanks to everyone for feature requests and bug reports. I'll do my best to port this to Arm as well — but first, I'm trying to ship one more thing before year's end.
I was worried (I find it confusing when Unicode "shadows" of normal letters exist, and those are of course also dangerous in some cases when they can be mis-interpreted for the letter they look more or less exactly like) by the article's use of U+212A (Kelvin symbol) as sample text, so I had to look it up [1].
Anyway, according to Wikipedia the dedicated symbol should not be used:
However, this is a compatibility character provided for compatibility with legacy encodings. The Unicode standard recommends using U+004B K LATIN CAPITAL LETTER K instead; that is, a normal capital K.
That was comforting, to me. :)
> ICU has many bindings. The Rust one doesn’t expose any substring search functionality, but the Python one does:
Python's ICU support is based on ICU4C. Rust's ICU "bindings" are actually a new implementation called ICU4X, by developers who worked on i18n at Mozilla and Google and on ICU4C, with the goal of a cleaner, more performant implementation that is also memory safe. Maybe not relevant (as in substantially altering the benchmarks), but it's at least worth noting that the ICU backends aren't consistent throughout.
Also very cool and approachable guy.
(Best wishes if you're reading this.)
Interestingly enough this library doesn't provide grapheme cluster tokenization and/or boundary checking which is one of the most useful primitive for this.