> The Sprite storage stack is organized around the JuiceFS model (in fact, we currently use a very hacked-up JuiceFS, with a rewritten SQLite metadata backend). It works by splitting storage into data (“chunks”) and metadata (a map of where the “chunks” are). Data chunks live on object stores; metadata lives in fast local storage. In our case, that metadata store is kept durable with Litestream. Nothing depends on local storage.
Unfortunately, the benchmarks use Redis. Why would I care about distributed storage on a system like S3, which is all about consistency/durability/availability guarantees, just to put my metadata into Redis?
It would be nice to see benchmarks with another metadata store.
When we tried it at Krea we ended up moving on because we couldn't get sufficient performance to train on, and having to choose which datacenter to deploy our metadata store on essentially forced us to only use it one location at a time.
Over the decades I have written test harnesses for many distributed filesystems and the only one that seemed to actually offer POSIX semantics was LustreFS, which, for related reasons, is also an operability nightmare.
I also like that it can keep a local read-cache, rather than having to hit object storage for every read. This is because it can perform a freshness check with with the (relatively fast) metadata store to determine if its cached data is valid, prior to serving the request from cache.
We back it with a 3-node (Redis-compatible) HA Valkey cluster, and in-cluster MinIO object storage, all in bare-metal Kubernetes. We can saturate a 25g NIC with (IIRC) 16+ concurrent users.
It is also one of the few Kubernetes storage providers that provides read-write-many (RWX) access. Which can also be rather helpful in some situations.
In an early test we were running it against MinIO with zero redundancy. Which is not recommended in any case. There we did see some file corruption creep in. In which case some files in JuiceFS become unreadable, but the system as a whole kept working.
Another reason I think JuiceFS works well, is indeed because of its custom block-based storage format. It is disconcerting because you cannot see your files in object storage, but instead just a lot of chunks. But this does buy some real performance benefits, especially when doing partial file reads or updates.
Another test we're doing is running a small-to-medium sized Prometheus persisted to JuiceFS. It hasn't shown any issues so far.
And, if you've made it this, far: check us out if you want a hand installing and operating this kind of infra: https://lithus.eu . We deploy to bare-metal Hetzner.
* JuiceFS - Works well, for high performance it has limited use cases where privacy concerns matter. There is the open source version, which is slower. The metadata backend selection really matters if you are tuning for latency.
* Lustre - Heavily optimised for latency. Gets very expensive if you need more bandwidth, as it is tiered and tied to volume sizes. Managed solutions available pretty much everywhere.
* EFS - Surprisingly good these days, still insanely expensive. Useful for small amounts of data (few terabytes).
* FlexFS - An interesting beast. It murders on bandwidth/cost. But slightly loses on latency sensitive operations. Great if you have petabyte scale data and need to parallel process it. But struggles when you have tooling that does many small unbuffered writes.
Although the maintainers of these projects disagree, I mostly consider them as a workaround for smaller projects. For big data (PB range) and critical production workloads I recommend to bite the bullet and make your software nativley S3 compatible without going over a POSIX mounted S3 proxy.
When we actually need to manipulate or generate something in Python, we download/upload to S3 and wrap it all in a tempfile.TemporaryDirectory() to cleanup the local disk when we're done. If you don't do this, you end up with a bunch of garbage eventually in /tmp/ you need to deal with.
We also have some longer-lived disk caches and using the data in the db and a os.stat() on the file we can easily know if the cache is up to date without hitting s3. And this cache, we can just delete stuff that's old wrt os.stat() to manage the size of it since we can always get it from s3 again if needed in the future.
* poor locking support (this sounds like it works better)
* it's slow
* no manual fence support; a bad but common way of distributing workloads is e.g. to compile a test on one machine (on an NFS mount), and then use SLURM or SGE to run the test on other machines. You use NFS to let the other machines access the data... and this works... except that you either have to disable write caches or have horrible hacks to make the output of the first machine visible to the others. What you really want is a manual fence: "make all changes to this directory visible on the server"
* The bloody .nfs000000 files. I think this might be fixed by NFSv4 but it seems like nobody actually uses that. (Not helped by the fact that CentOS 7 is considered "modern" to EDA people.)
> * Close-to-open consistency. Once a file is written and closed, it is guaranteed to view the written data in the following opens and reads from any client. Within the same mount point, all the written data can be read immediately.*
> Rename and all other metadata operations are atomic, which are guaranteed by supported metadata engine transaction.
This is a lot more than other "POSIX compatible" overlays claim, and I think similar to what NFSv4 promises. There are lots of subtitles there, though, and I doubt you could safely run a database on it.
We need a kernel native distributed file system so that we can build distributed storage/databases on top of it.
This is like building an operating system on top of a browser.
I'm not an enterprise-storage guy (just sqlite on a local volume for me so far!) so those really helped de-abstractify what JuiceFS is for.