FRESH

Hacker News

Home

JuiceFS is a distributed POSIX file system built on top of Redis and S3

182 points by tosh

by wgjordan

0 subcomment

Related, "The Design & Implementation of Sprites" [1] (also currently on the front page) mentioned JuiceFS in its stack:
> The Sprite storage stack is organized around the JuiceFS model (in fact, we currently use a very hacked-up JuiceFS, with a rewritten SQLite metadata backend). It works by splitting storage into data (“chunks”) and metadata (a map of where the “chunks” are). Data chunks live on object stores; metadata lives in fast local storage. In our case, that metadata store is kept durable with Litestream. Nothing depends on local storage.
[1] https://news.ycombinator.com/item?id=46634450

by staticassertion

5 subcomments

Do people really trust Redis for something like this? I feel like it's sort of pointless to pair Redis with S3 like this, and it'd be better to see benchmarks with metadata stores that can provide actual guarantees for durability/availability.
Unfortunately, the benchmarks use Redis. Why would I care about distributed storage on a system like S3, which is all about consistency/durability/availability guarantees, just to put my metadata into Redis?
It would be nice to see benchmarks with another metadata store.

by willbeddow

3 subcomments

Juice is cool, but tradeoffs around which metadata store you choose end up being very important. It also writes files in it's own uninterpretable format to object storage, so if you lose the metadata store, you lose your data.
When we tried it at Krea we ended up moving on because we couldn't get sufficient performance to train on, and having to choose which datacenter to deploy our metadata store on essentially forced us to only use it one location at a time.

by jeffbee

1 subcomments

It is not clear that pjdfstest establishes full POSIX semantic compliance. After a short search of the repo I did not see anything that exercises multiple unrelated processes atomically writing with O_APPEND, for example. And the fact that their graphic shows applications interfacing with JuiceFS over NFS and SMB casts further doubt, since both of those lack many POSIX semantic properties.
Over the decades I have written test harnesses for many distributed filesystems and the only one that seemed to actually offer POSIX semantics was LustreFS, which, for related reasons, is also an operability nightmare.

by adamcharnock

1 subcomments

We've been using JuiceFS in production for a few months now and I'm a big fan. I've felt for a while that block-level filesystems do not adapt at all well to being implemented across a network (with my personal experience being of AWS EBS and OpenEBS Mayastor). So the fact that JuiceFS is interfaces at the POSIX layer felt intuitively better to me.
I also like that it can keep a local read-cache, rather than having to hit object storage for every read. This is because it can perform a freshness check with with the (relatively fast) metadata store to determine if its cached data is valid, prior to serving the request from cache.
We back it with a 3-node (Redis-compatible) HA Valkey cluster, and in-cluster MinIO object storage, all in bare-metal Kubernetes. We can saturate a 25g NIC with (IIRC) 16+ concurrent users.
It is also one of the few Kubernetes storage providers that provides read-write-many (RWX) access. Which can also be rather helpful in some situations.
In an early test we were running it against MinIO with zero redundancy. Which is not recommended in any case. There we did see some file corruption creep in. In which case some files in JuiceFS become unreadable, but the system as a whole kept working.
Another reason I think JuiceFS works well, is indeed because of its custom block-based storage format. It is disconcerting because you cannot see your files in object storage, but instead just a lot of chunks. But this does buy some real performance benefits, especially when doing partial file reads or updates.
Another test we're doing is running a small-to-medium sized Prometheus persisted to JuiceFS. It hasn't shown any issues so far.
And, if you've made it this, far: check us out if you want a hand installing and operating this kind of infra: https://lithus.eu . We deploy to bare-metal Hetzner.

by eerikkivistik

2 subcomments

I've had to test out various networked filesystems this year for a few use cases (satellite/geo) on a multi petabyte scale. Some of my thoughts:
* JuiceFS - Works well, for high performance it has limited use cases where privacy concerns matter. There is the open source version, which is slower. The metadata backend selection really matters if you are tuning for latency.
* Lustre - Heavily optimised for latency. Gets very expensive if you need more bandwidth, as it is tiered and tied to volume sizes. Managed solutions available pretty much everywhere.
* EFS - Surprisingly good these days, still insanely expensive. Useful for small amounts of data (few terabytes).
* FlexFS - An interesting beast. It murders on bandwidth/cost. But slightly loses on latency sensitive operations. Great if you have petabyte scale data and need to parallel process it. But struggles when you have tooling that does many small unbuffered writes.

by tuhgdetzhh

3 subcomments

If tested various Posix FS projects over the years and everyone has their shortcomings in one way or the other.
Although the maintainers of these projects disagree, I mostly consider them as a workaround for smaller projects. For big data (PB range) and critical production workloads I recommend to bite the bullet and make your software nativley S3 compatible without going over a POSIX mounted S3 proxy.

by mattbillenstein

0 subcomment

The key I think with s3 is using it mostly as a blobstore. We put the important metadata we want into postgres so we can quickly select stuff that needs to be updated based on other things being newer. So, we don't need to touch s3 that often if we don't need the actual data.
When we actually need to manipulate or generate something in Python, we download/upload to S3 and wrap it all in a tempfile.TemporaryDirectory() to cleanup the local disk when we're done. If you don't do this, you end up with a bunch of garbage eventually in /tmp/ you need to deal with.
We also have some longer-lived disk caches and using the data in the db and a os.stat() on the file we can easily know if the cache is up to date without hitting s3. And this cache, we can just delete stuff that's old wrt os.stat() to manage the size of it since we can always get it from s3 again if needed in the future.

by IshKebab

5 subcomments

Interesting. Would this be suitable as a replacement for NFS? In my experience literally everyone in the silicon design industry uses NFS on their compute grid and it sucks in numerous ways:
* poor locking support (this sounds like it works better)
* it's slow
* no manual fence support; a bad but common way of distributing workloads is e.g. to compile a test on one machine (on an NFS mount), and then use SLURM or SGE to run the test on other machines. You use NFS to let the other machines access the data... and this works... except that you either have to disable write caches or have horrible hacks to make the output of the first machine visible to the others. What you really want is a manual fence: "make all changes to this directory visible on the server"
* The bloody .nfs000000 files. I think this might be fixed by NFSv4 but it seems like nobody actually uses that. (Not helped by the fact that CentOS 7 is considered "modern" to EDA people.)

by weinzierl

1 subcomments

The consistency guarantees are what makes this interesting in my opinion.
> * Close-to-open consistency. Once a file is written and closed, it is guaranteed to view the written data in the following opens and reads from any client. Within the same mount point, all the written data can be read immediately.*
> Rename and all other metadata operations are atomic, which are guaranteed by supported metadata engine transaction.
This is a lot more than other "POSIX compatible" overlays claim, and I think similar to what NFSv4 promises. There are lots of subtitles there, though, and I doubt you could safely run a database on it.

by hsn915

2 subcomments

This is upside down.
We need a kernel native distributed file system so that we can build distributed storage/databases on top of it.
This is like building an operating system on top of a browser.

by sabslikesobs

0 subcomment

See also their User Stories: https://juicefs.com/en/blog/user-stories
I'm not an enterprise-storage guy (just sqlite on a local volume for me so far!) so those really helped de-abstractify what JuiceFS is for.

by Plasmoid

0 subcomment

I was actually looking at using this to replace our mongo disks so we could easily cold store our data

by Eikon

10 subcomments

ZeroFS [0] outperforms JuiceFS on common small file workloads [1] while only requiring S3 and no 3rd party database.
[0] https://github.com/Barre/ZeroFS
[1] https://www.zerofs.net/zerofs-vs-juicefs

by eru

0 subcomment

Distributed filesystem and POSIX don't go together well.