FRESH

Hacker News

Efficient String Compression for Modern Database Systems

152 points by jandrewrogers

by crazygringo

8 subcomments

I'm genuinely surprised that there isn't column-level shared-dictionary string compression built into SQLite, MySQL/MariaDB or Postgres, like this post is describing.
SQLite has no compression support, MySQL/MariaDB have page-level compression which doesn't work great and I've never seen anyone enable in production, and Postgres has per-value compression which is good for extremely long strings, but useless for short ones.
There are just so many string columns where values and substrings get repeated so much, whether you're storing names, URL's, or just regular text. And I have databases I know would be reduced in size by at least half.
Is it just really really hard to maintain a shared dictionary when constantly adding and deleting values? Is there just no established reference algorithm for it?
It still seems like it would be worth it even if it were something you had to manually set. E.g. wait until your table has 100,000 values, build a dictionary from those, and the dictionary is set in stone and used for the next 10,000,000 rows too unless you rebuild it in the future (which would be an expensive operation).

by cryptonector

3 subcomments

^F intern -> not found.
But string interning is what they're doing, isn't it?
> Dictionary compression is a well-known and widely applied technique. The basic idea is that you store all the unique input values within a dictionary, and then you compress the input by substituting the input values with smaller fixed-size integer keys that act as offsets into the dictionary. Building a CedarDB dictionary on our input data and compressing the data would look like this:
That's string interning!!
Is interning just too old a concept now and it has to be rediscovered/reinvented and renamed?

by ayuhito

0 subcomment

DuckDB has one of my favourite articles on this topic if you want something a little more high level: https://duckdb.org/2022/10/28/lightweight-compression

by solatic

0 subcomment

Really don't understand.
Their own cherry-picked example benchmarks show that there is frequently no benefit, after enabling FSST, to query runtimes and sometimes query runtimes are actually negatively affected. Despite hard empirical evidence against FSST, they decided to enable it across-the-board? Not even "hey, we built this cool thing which may or may not improve your workloads, here's how to opt-in", but pushing straight up performance regressions for their customers with no way of opting-out?
Why would I ever trust my data to engineers who are more interested in their academically-interesting resume-driven-development than in actual customer outcomes?

by cespare

0 subcomment

This scheme reminds me of smaz (https://github.com/antirez/smaz) but with dynamically generated dictionaries.

by lor_louis

0 subcomment

I've implemented a similar system based on the original 2020 paper, but we applied it to the text log to try to "extract" similar features from free-form text. It looked promising and even supported full regex search, but the work was ultimately abandoned when we got acquired.

by stephantul

0 subcomment

The compression algorithm is very similar to a greedy subword tokenizer, which is used in BERT and other older language models, but has become less popular in favor of BPE.

by mbfg

2 subcomments

by vismit2000

0 subcomment

by vivzkestrel

1 subcomments

pardon my stupid question, what does cedardb do that postgres, duckdb, cassandra and the 758452758421 databases that came before dont?

by ForHackernews

2 subcomments

Never heard of CedarDB.
Seems to be another commercial cloud-hosted thing offering a Postgres API? https://dbdb.io/db/cedardb
https://cedardb.com/blog/ode_to_postgres/

by semiinfinitely

1 subcomments