FRESH

Hacker News

Home

447 points by tosh

by vouwfietsman

2 subcomments

Not sure why this got so many upvotes, also the landing page is not great, its better to look at the paper (see link below).
Seems to be a columnar storage format that addresses some shortcomings in parquet. Thing is, though, that of all these formats the real winning feature is compatibility, which is (obviously) very hard to improve on, as anything new immediately loses.
Parquet is unfortunately very good just by virtue of being first, and so widely supported. The most widely used parquet version is the oldest version from 2013 (as per the paper itself), so parquet itself couldn't even supplant parquet. If you want to improve on it, you need to bring some serious results, which I don't think f3 does.
Also, my main gripe with parquet (single table per file) is not even addressed, so, also the name is a bit hyped up.
Also also, it seems to go out of its own way to include a compiled wasm binary for decoding, yet requires flatbuffers to parse that blob? Kind of defeats the purpose.
Its main result seems to be improved random access which, although certainly welcome, is not the point of columnar storage, as columnar storage was invented to exchange random access for something else: fast analytics. F3 seems to sacrifice fast analytics for the wasm decoder. I don't get it.
Maybe I'm being too cynical. Can someone help me out here?
https://dl.acm.org/doi/epdf/10.1145/3749163

by gavinray

8 subcomments

This bit is quite genius, rather than depend on a language-specific SDK/lib for working with the formats you can fallback to exported WASM methods if none exist:

  > "Each self-describing F3 file includes both the data and meta-data, as well as WebAssembly (Wasm) binaries to decode the data. Embedding the decoders in each file requires minimal storage (kilobytes) and ensures compatibility on any platform in case native decoders are unavailable. "

by sph

1 subcomments

I don’t know what are people commenting on. I see a README with little to no information about what this is, what problems it solves, just links to its Flatbuffer description and a directory full of source code.
What context am I missing?

by largbae

4 subcomments

This could use a bit more "why".
Shortcomings of Parquet are mentioned as overcome by this, which ones? Certainly not wide tool support...
Why should one leave Parquet or ORC for this structure?

by zerobees

1 subcomments

Some folks described it as genius. I guess it's my turn to play the role of an annoying HN skeptic: I find it somewhat silly. Data compression formats are secondary to what you're planning to do with the data once decoded. An audio file is completely different than an SVG image. An embedded VM that decompresses video to raw pixels doesn't magically let you play that video in a text editor, so there's no radically new kind of interoperability. Each new format still needs to be handled in a format-specific way.
I guess one use case is that I come up with a video compression scheme that's better than H.265, but not all platforms support it, so I embed a decoder that would allow me to play it back on legacy hardware. But that also shows the weakness of the idea: it's unlikely that legacy hardware will perform well doing software-only decode for video formats from the future. If we rolled this idea out in the 1990s, it would not have allowed watching Netflix on an i386.
In the same vein, I doubt this would have allowed me to open Word 2021 files in Word 97. There's no 1-to-1 mapping between the data structures. So if this kind of compat isn't slam-dunk, what's the goal?
The downsides are clear. First, it's probably a maintenance nightmare: if your decoder has a bug that needs fixing, how do you patch all the files that already embed it? And then, there's size overhead and security risks. We're adding a considerable attack surface to every format parser. It's more opportunities for remote code execution, resource exhaustion attacks, and so on. Again, this is not always wrong, but what's the benefit?

by amluto

1 subcomments

One nice thing about some modern formats is that there are tools that read them at extraordinarily high effective speed. For example, DuckDB can do all manner of nifty optimizations while reading its own native format or Parquet. And I’m not sure that those optimizations can be effectively applied to a format that needs a WASM blob to be run to understand it. By the time you run a non-SIMD or even a SIMD-optimized pass over app the data, if that pass doesn’t understand your query, you may have already lost.
I admit I only skimmed the beginning of the paper, and maybe the format is less general than it sounds.

by Groxx

0 subcomment

Hm. I can kinda see it replacing self-extracting EXEs, but a lot of why you choose specific file formats is for specific features they offer - any self-describing system can fall into "there are too many competing features and nobody handles them all" exactly as easily as any other format.
Like, can this file be efficiently mmap'd? Maybe if it emulates tar internally, but you don't know until you run it. Can it be seeked to specific bytes to only decompress part? It only supports a pre-release version of ISO-36898533 seeking, and your file library dropped support for it 6 years ago. If I rewrite 1MB in the middle, can it only change those pages on disk (and maybe an index), or do I have to rewrite the whole thing? Well the wasm blob supports 97 different APIs for it (there are 35 copies of one with different names), so it's larger than the data (but nobody paid attention to that), so you have 19 options that you recognize, but your CPU's native WASM accelerator only handles two or three so you've still got to specialize your code heavily.
At least with "*.tar.gz" you have some idea of what's possible.

by owentbrown

0 subcomment

Nice! The world can always use a better data format.
I think you might get some traction if you post the advantages over parquet and other files directly on the readme, so that if someone goes to https://github.com/future-file-format/f3 the see why they should try it.
Mention the advantages and post metrics. Cherry pick the metrics! There's probably a good use case for this but, from the current readme, it's not clear who should use this and why.

by anygivnthursday

1 subcomments

My concern is, if decode fails I need to debug WASM added by some other party maybe containing random bugs. Maybe a library of standard decoders maintained and tested by the project could help, but then not sure if it kills the advantage of the flexibility it provides.

by coffeecoders

2 subcomments

If I am archiving PBs of data for 10+ years, I don't want to rely on a WASM interpreter being available and performant in the future just to read a file. I want a dead-simple, heavily documented byte specification like Parquet.
Additionally, putting the decoding logic inside an WASM binary introduces an active execution layer into what should be a cold storage.

by Qerub

0 subcomment

This reminds me of Alan Kay's OOPSLA 1997 presentation "The Computer Revolution Hasn’t Happened Yet" when he describes the Air Force / Burroughs 220 file format from 1961 where the file/tape contained both the data and the procedures to read/write/print them: https://youtu.be/oKg1hTOQXoY?t=1355

by dang

0 subcomment

One past discussion:
F3: Open-source data file format for the future [pdf] - https://news.ycombinator.com/item?id=45437759 - Oct 2025 (125 comments)
plus this bit:
An Open File Format for storing the information from a forge - https://news.ycombinator.com/item?id=44043253 - May 2025 (1 comment)

by krzyk

2 subcomments

File format for what? Text, graphics, compiled code?

by thisisauserid

0 subcomment

Great! I'll use it.
In the "future."
Nimble? Lance? Also in the future. Maybe.
I'll use Parquet in the present.

by nine_k

1 subcomments

F3 seems to be a reasonable archival data format.
I see many replies criticizing F3 as an operational data format, like Parquet. Of course it can't be made as fast in the general case, or as compatible to the existing infrastructure.
OTOH F3 would be easy to decode into almost any of today's accepted formats, and likely to any of tomorrow's data formats. That's where being self-describing and self-unpacking would be important.

by drdexebtjl

0 subcomment

Probably not a good idea to name your project “future” anything, if you expect that future to become the present.
Also, f3 is already “fight-flash-fraud”.

by chatmasta

0 subcomment

As appealing as this is, it will never gain traction without some backwards compatibility with Parquet and wide adoption of query engines to implement that backwards compatible path.

by Arainach

1 subcomments

This project README is not particularly useful:
It doesn't explain what the project does (a file format for what? Name dropping other things I haven't heard of isn't useful)
There are no examples. It links to a flatbuffer schema which is at least well commented, but is full of deep implementation details.
The point is that within 2-3 minutes I'm not convinced why I care and still don't know enough about what this is to even think back to if if I encounter a scenario in the future where it would be useful.
> designed with efficiency, interoperability, and extensibility in mind. It provides a data organization that rectifies the layout shortcomings of the last-generation formats like Parquet,
This is all marketing speak that says nothing.
> maintaining good interoperability and extensibility (a.k.a future-proof) via embedded Wasm decoders What does this even mean? Providing a decoder is no guarantee of futureproofness.

by mmaunder

0 subcomment

A Wasm decoder takes encoded bytes and returns an iterator of Arrow Buffers. In case you were wondering.

by mmaunder

0 subcomment

Not quite vaporware, but few commits, PRs, history, actual examples etc. It's pretty thin.

by gruntled-worker

1 subcomments

Are we positively sure that WASM will prove to be more future-proof than 640K MS-DOS or WinXP, or SNES cartridge files for that matter? On 6/23/26 there are a lot of emulators that run these. Will WASM necessarily beat them on 6/23/2051? Might be a case of xkcd 927.

0 subcomment

by meta-level

1 subcomments

Don't know why but I had to think of https://xkcd.com/2116/

by adammarples

1 subcomments

No commits in 8 months?

0 subcomment

by lowbloodsugar

0 subcomment

>via embedded Wasm decoders
runs screaming

by ShinyLeftPad

0 subcomment

To save a click it's a file format for columnar data specifically (like Parquet), which they very generically named Future-proof File Format. Most of this could fit in the title instead of just "F3"

by GolDDranks

0 subcomment

I love the idea, and I developed something similar of myself in the past (https://github.com/golddranks/kobuta), but... this reeks of slop. With Rust code, edition="2021" is a dead giveaway.

by jauntywundrkind

0 subcomment

The wasm decoder thing was also done in Anyblox. https://github.com/AnyBlox https://gienieczko.com/anyblox-paper
Has nimble/velox had any better luck lately? I forget what stories someone shared, but, it seemed to have such big intent, then real trouble actually getting released. I want to say someone was saying the lawyers ended up not letting a lot of the work get released. Nimble is the one competitor benchmarked against here that beats them, and is also extensible (to some degree?), so I'd love to know how things have gone for the past 6-12 months for nimble/velox. https://news.ycombinator.com/item?id=39995112 https://github.com/facebookincubator/nimble/ https://materializedview.io/p/nimble-and-lance-parquet-kille...

by antisthenes

0 subcomment

The description mentions shortcomings of the previous file types like parquet, but it isn't really evident to me what those shortcomings are, or if the use cases for parquet and F3 have really that much of an overlap to make this comparison valid in the first place.

by MoonWalk

0 subcomment

Is what?

by corvad

0 subcomment

https://xkcd.com/927/

0 subcomment

by ChrisArchitect

0 subcomment

A more descriptive title would be helpful OP:
F3: Open-source data file format for the future
Previous discussion:
2025 https://news.ycombinator.com/item?id=45437759