FRESH

Hacker News

Full Unicode Search at 50× ICU Speed with AVX‑512

215 points by ashvardanian

by bbminner

3 subcomments

I was really confused about the case folding, this page explained the motivation well https://jean.abou-samra.fr/blog/unicode-misconceptions
""" Continuing with the previous example of “ß”, one has lowercase("ss") != lowercase("ß") but uppercase("ss") == uppercase("ß"). Conversely, for legacy reasons (compatibility with encodings predating Unicode), there exists a Kelvin sign “K”, which is distinct from the Latin uppercase letter “K”, but also lowercases to the normal Latin lowercase letter “k”, so that uppercase("K") != uppercase("K") but lowercase("K") == lowercase("K").
The correct way is to use Unicode case folding, a form of normalization designed specifically for case-insensitive comparisons. Both casefold("ß") == casefold("ss") and casefold("K") == casefold("K") are true. Case folding usually yields the same result as lowercasing, but not always (e.g., “ß” lowercases to itself but case-folds to “ss”). """
One question I have is why have Kelvin sign that is distinct from Latin K and other indistinguishable symbols? To make quantified machine readable (oh, this is not a 100K license plate or money amount, but a temperature)? Or to make it easier for specialized software to display it in correct placed/units?

by ashvardanian

5 subcomments

This article is about the ugliest — but arguably the most important — piece of open-source software I’ve written this year. The write-up ended up long and dense, so here’s a short TL;DR:
I grouped all Unicode 17 case-folding rules and built ~3K lines of AVX-512 kernels around them to enable fully standards-compliant, case-insensitive substring search across the entire 1M+ Unicode range, operating directly on UTF-8 bytes. In practice, this is often ~50× faster than ICU, and also less wrong than most tools people rely on today—from grep-style utilities to products like Google Docs, Microsoft Excel, and VS Code.
StringZilla v4.5 is available for C99, C++11, Python 3, Rust, Swift, Go, and JavaScript. The article covers the algorithmic tradeoffs, benchmarks across 20+ Wikipedia dumps in different languages, and quick starts for each binding.
Thanks to everyone for feature requests and bug reports. I'll do my best to port this to Arm as well — but first, I'm trying to ship one more thing before year's end.

by unwind

1 subcomments

by anonnon

1 subcomments

> ICU has bindings for Rust that provide case-folding functionality, but not case-insensitive substring search.
> ICU has many bindings. The Rust one doesn’t expose any substring search functionality, but the Python one does:
Python's ICU support is based on ICU4C. Rust's ICU "bindings" are actually a new implementation called ICU4X, by developers who worked on i18n at Mozilla and Google and on ICU4C, with the goal of a cleaner, more performant implementation that is also memory safe. Maybe not relevant (as in substantially altering the benchmarks), but it's at least worth noting that the ICU backends aren't consistent throughout.

by orthoxerox

1 subcomments

Is it possible to extend this to support additional transformation rules like Any-Latin;Latin-ASCII? To make it possible to find "Վարդանյան" in a haystack by searching for "vardanyan"?

by moralestapia

0 subcomment

Ash is amazing!
Also very cool and approachable guy.
(Best wishes if you're reading this.)

by mgaunard

2 subcomments

In practice you should always normalize your Unicode data, then all you need to do is memcmp + boundary check.
Interestingly enough this library doesn't provide grapheme cluster tokenization and/or boundary checking which is one of the most useful primitive for this.

by hans_castorp

1 subcomments

by bawolff

0 subcomment

While this is definitely really cool, wouldn't people who need this type of speed usually just store the text to be searched already case folded?

by kardianos

1 subcomments

Looks neat. What are all the genomic sequence comparisons in there for? Is this a grab bag of interesting string methods or is there a motivation for this?

by user3939382

0 subcomment

by xking5205

0 subcomment

by andersa

5 subcomments

From a German user perspective, ICU and your fancy library are incorrect, actually. Mass is not a different casing of Maß, they are different characters. Google likely changed this because it didn't do what users wanted.