FRESH

Hacker News

Home

People Keep Inventing Prolly Trees

182 points by lifty

by wmanley

0 subcomment

Here's apenwarr's description of the same data structure from 2009: https://apenwarr.ca/log/20091004 .
Here's a post to the git mailing list from Martin Uecker describing the same from 2005: https://lore.kernel.org/git/20050416173702.GA12605@macavity/ . From the tone of the email it sounds like he didn't consider the idea new at that point:
> The chunk boundaries should be determined deterministically from local properties of the data. Use a rolling checksum over some small window and split the file it it hits a special value (0). This is what the rsyncable patch to zlib does.
He calls it a merkle hash tree.
Edit: here's one that's one day earlier from C. Scott Ananian: https://lore.kernel.org/git/Pine.LNX.4.61.0504151232160.2763...
> We already have the rsync algorithm which can scan through a file and efficiently tell which existing chunks match (portions of) it, using a rolling checksum. (Here's a refresher: http://samba.anu.edu.au/rsync/tech_report/node2.html ). Why not treat the 'chunk' as the fundamental unit, and compose files from chunks?

by gritzko

1 subcomments

Prolly Trees are Merkle-fied B-trees, essentially.
I am working on related things[r], using Merkle-fied LSM trees. Ink&Switch do things that very closely resemble Merklefied LSM[h], although they are not exactly LSM. I would not be surprised if someone else is doing something similar in parallel. The tricks are very similar to Prollies, but LSM instead of B-trees.
That reminds me my younger years when I "invented" the Causal Tree[c] data structure. It was later reinvented as RGA (Replicated Growable Array [a]), Timestamped Insertion Tree and, I believe, YATA. All seem to be variations of a very very old revision control data structure named "weave"[w].
Recently I improved CT to the degree that warranted a new algorithm name (DISCONT [d]). Fundamentally the same, but much cheaper. Probably, we should see all these "inventions" as improvements. All the Computer Science basics seem to have been invented in the 70s, 80s the latest.
[w]: https://docs.rs/weave/latest/weave/
[r]: https://github.com/gritzko/librdx
[d]: https://github.com/gritzko/go-rdx/blob/main/DISCOUNT.md
[h]: https://www.inkandswitch.com/keyhive/notebook/05/
[c]: https://dl.acm.org/doi/10.1145/1832772.1832777
[a]: https://pages.lip6.fr/Marc.Shapiro/papers/RR-7687.pdf links to the authors of RGA

by compressedgas

2 subcomments

This article does not mention Jumbostore (Kave Eshghi, Mark Lillibridge, Lawrence Wilcock, Guillaume Belrose, and Rycharde Hawkes) which used content defined chunking recursively on the chunk list of a content defined chunked file in 2007. This is exactly what a Prolly Tree is.

by ChadNauseam

9 subcomments

Haha, this is funny. I've been obsessed with rolling-hash based chunking since I read about it in the dat paper. I didn't realize there was a tree version, but it is a natural extension.
I have a related cryptosystem that I came up with, but is so obvious I'm sure someone else has invented it first. The idea is to back up a file like so: first, do a rolling-hash based chunking, then encrypt each chunk where the key is the hash of that chunk. Then, upload the chunks to the server, along with a file (encrypted by your personal key) that contains the information needed to decrypt each chunk and reassemble them. If multiple users used this strategy, any files they have in common would result in the same chunks being uploaded. This would let the server provider deduplicate those files (saving space), without giving the server provider the ability to read the files. (Unless they already know exactly which file they're looking for, and just want to test whether you're storing it.)
Tangent: why is it that downloading a large file is such a bad experience on the internet? If you lose internet halfway through, the connection is closed and you're just screwed. I don't think it should be a requirement, but it would be nice if there was some protocol understood by browsers and web servers that would be able to break-up and re-assemble a download request into a prolly tree, so I could pick up downloading where I left off, or only download what changed since the last time I downloaded something.

by inetknght

0 subcomment

> Sometimes an invention is not widely known because its creator doesn't realize they've created something novel: the design seemed obvious and intuitive to them.
I never went to high school or college or anything.
I can't tell you how many times I come up with something, only to discover years later that someone else came up with the same idea later (or sometimes earlier), branded it, and marketed it.

by elric

2 subcomments

How do people find specialised data structure that they aren't already aware of? Stumbling across random blog posts and reading the odd book on data structures can't be the optimal way.
Is there a way to search for a structure by properties? E.g. O(1) lookups, O(log(n)) inserts or better, navigates like a tree (just making this up), etc?

by jerf

1 subcomments

I kind of like data structures, because when you study them, and take them apart, and understand their pieces, rather than seeing the world as Prolly Trees here and Binary Trees there and Bloom Filters over there, you see a whole bunch of little tricks you can use, and when you collect a reasonably large bag of those tricks you can put them together in all sorts of ways.
It's almost a pity computers are as fast as they are and they are so rarely needed because having "arrays" and "maps/dicts/associative arrays/whatever" solves so many problems so much faster than we need anyhow. I don't get to pull out the bag of tricks very often. But then again, when I do, it's because it's a life saver and the difference between success and failure, so maybe it all balances out.

0 subcomment

by donatj

0 subcomment

It would probably help if they had a Wikipedia page. Someone who actually understands what they are should get on that.
Googling "Prolly Trees", there's not much and this article is one of the top results.

by iamwil

1 subcomments

Anyone know if editing a prolly tree requires reconstructing the entire tree from the leaves again? All the examples I've ever seen in a wild reconstruct from the bottom up. Presumably, you can leave the untouched leaves intact, and the reconstruct parent nodes whose hashes have changed due to the changed leaves. I ended up doing an implementation of this, and wondered if it's of any interest or value to others?

0 subcomment

by guywithahat

0 subcomment

One things an exec said at my old job I liked was “research is generally ~5-10 years behind industry”, and I see that seems to still be the case for prolly trees

by zombot

2 subcomments

What the hell does "probabilistically balanced" mean?