FRESH

Hacker News

Confusables.txt and NFKC disagree on 31 characters

59 points by pimterry

by akersten

2 subcomments

Unicode is both the best thing that's ever happened to text encoding and the worst. The approach I take here is to treat any text coming from the user as toxic waste. Assume it will say "Administrator" or "Official Government Employee" or be 800 pixels tall because it was built only out of decorative combining characters. Then put it in a fixed box with overflow hidden, and use some other UI element to convey things like "this is an official account."
The worst part that this article doesn't even touch on with normalizing and remapping characters is the risk your login form doesn't do it but your database does. Suddenly I can re-register an existing account by using a different set of codepoints that the login system doesn't think exists but the auth system maps to somebody else's record.

by joshdata

3 subcomments

> If your application also runs NFKC normalization (which it should — ENS, GitHub, and Unicode IDNA all require it)
That's not right. Most of the web requires NFC normalization, not NFKC. NFC doesn't lose information in the original string. It reorders and combines code points into equivalent code point sequences, e.g. to simplify equality tests.
In NFKC, the K for "Compatibility" means some characters are replaced with similar, simpler code points. I've found NFKC useful for making text search indexes where you want matches to be forgiving, but it would be both obvious and wrong to use it in most of the web because it would dramatically change what the user has entered. See the examples in https://www.unicode.org/reports/tr15/.

by Liftyee

1 subcomments

Does the "removing dead code" advantage outweigh the additional complexity of having to maintain 2 different confusables lists: one for when NFKC has been applied first and one without? It didn't sound like applying one after the other caused any errors, just that some previously reachable states are unreachable.

by happytoexplain

1 subcomments

by kccqzy

1 subcomments

If you allow users to submit arbitrary Unicode string as text, why would you need to check confusables.txt? Whose confusion are you guarding against?

by rurban

0 subcomment

That's a user pipeline problem. If you just check confusables without any tr39 algo you will be disappointed also. I had to patch confusables for the C/C++ proposal for about 10 characters also.
"...the default confusables list is extremely buggy. It needs at least 7 manual exceptions for the ASCII range, 12 exceptions for Greek, and I didn’t check any others scripts. python and clang-tidy were very unsuccessful with this approach, compared to java, rust and cperl with the mixed-script approach." https://rurban.github.io/libu8ident/#confusables
In detail: https://rurban.github.io/libu8ident/doc/D2528R1.html at 10 TR39 Mixed Scripts

by brazzy

3 subcomments

> The correct use is to check whether a submitted identifier contains characters that visually mimic Latin letters, and if so, reject it
That is a really bad and user-hostile thing to do. Many of those characters are perfectly valid characters in various non-latin scripts. If you want everyone to force Latin script for identifiers, then own up to it and say so. But rejecting just some them for being too similar to latin characters just makes the behaviour inconsistent and confusing for users.

by csense

2 subcomments

My theory: The "long S" in "Congreſs" is an f. They used f instead of s because without modern dental care, a lot of people in the 1600's and 1700's were miffing teeth and fpoke with a lifp.