Just with those two criteria you’re down to, like, six formats at most, of which Protocol Buffers is the most widely used.
And I know the article says no one uses the backwards compatible stuff but that’s bizarre to me – setting up N clients and a server that use protocol buffers to communicate and then being able to add fields to the schema and then deploy the servers and clients in any order is way nicer than it is with some other formats that force you to babysit deployment order.
The reason why protos suck is because remote procedure calls suck, and protos expose that suckage instead of trying to hide it until you trip on it. I hope the people working on protos, and other alternatives, continue to improve them, but they’re not worse than not using them today.
Just FYI: an obligatory comment from the protobuf v2 designer.
Yeah, protobuf has lots of design mistakes but this article is written by someone who does not understand the problem space. Most of the complexity of serialization comes from implementation compatibility between different timepoints. This significantly limits design space.
Chances are, the author literally used software that does it as he wrote these words. This feature is critical to how Chrome Sync works. You wouldn’t want to lose synced state if you use an older browser version on another device that doesn’t recognize the unknown fields and silently drops them. This is so important that at some point Chrome literally forked protobuf library so that unknown fields are preserved even if you are using protobuf lite mode.
The pattern seems to be that generalized, user-composable solutions are discouraged in favor of a myriad of special constructs that satisfy whatever concrete use cases seem relevant for the designers in the moment.
This works for a while and reduces the complexity of the language upfront, while delivering results - but over time, the designs devolve into a rats's nest of hyperspecific design features with awkward and unintuitive restrictions.
Eventually, the designers might give up and add more general constructs to the language - but those feel tacked on and have to coexist with specific features that can't be removed anymore.
True story: trying to reverse engineer macOS Photos.app sqlite database format to extract human-readable location data from an image.
I eventually figured it out, but it was:
A base64 encoded Binary Plist format with one field containing a ProtoBuffer which contained another protobuffer which contained a unicode string which contained improperly encoded data (for example, U+2013 EN DASH was encoded as \342\200\223)
This could have been a simple JSON string.
https://news.ycombinator.com/item?id=18188519 (299 comments)
https://news.ycombinator.com/item?id=21871514 (215 comments)
https://news.ycombinator.com/item?id=35281561 (59 comments)
At some stage with every ESP or Arduino project, I want to send and receive data, i.e. telemetry and control messages. A lot of people use ad-hoc protocols or HTTP/JSON, but I decided to try the nanopb library. I ended up with a relatively neat solution that just uses UDP packets. For my purposes a single packet has plenty of space, and I can easily extend this approach in the future. I know I'm not the first person to do this but I'll probably keep using protobufs until something better comes along, because the ecosystem exists and I can focus on the stuff I consider to be fun.
This is a rage bait, not worth the read.
I almost burst out in laughter when the article argued that you should reuse types in preference to inlining definitions. If you've ever felt the pain of needing to split something up, you would not be so eager to reuse. In a codebase with a single process, it's pretty trivial to refactor to split things apart; you can make one CL and be done. In a system with persistence and distribution, it's a lot more awkward.
That whole meaning of data vs representation thing. There's fundamentally a truth in the correspondence. As a program evolves, its understanding of its domain increases, and the fidelity of its internal representations increase too, by becoming more specific, more differentiated, more nuanced. But the old data doesn't go away. You don't get to fill in detail for data that was gathered in older times. Sometimes, the referents don't even exist any more. Everything is optional; what was one field may become two fields in the future, with split responsibilities, increased fidelity to the domain.
The fact that the author is arguing for making all messages required means they don't understand the reasoning for why all fields are optional. This breaks systems (there are are postmortems outlining this) then there are proto mismatches .
This sums up a lot of the issues I’ve seen with protobuf as well. It’s not an expressive enough language to be the core data model, yet people use it that way.
In general, if you don’t have extreme network needs, then protobuf seems to cause more harm than good. I’ve watched Go teams spend months of time implementing proto based systems with little to no gain over just REST.
Nearly every other complaint is solved by wrapping things in messages (sorry, product types). Don't get the enum limitation on map keys, that complaint is fair.
Protobuf eliminates truckloads of stupid serialization/deserialization code that, in my embedded world, almost always has to be hand-written otherwise. If there was a tool that automatically spat out matching C, Kotlin, and Swift parsers from CDDL, I'd certainly give it a shot.
> * Make all fields in a message required. This makes messages product types.
Meanwhile in the capnproto FAQ:
>How do I make a field “required”, like in Protocol Buffers?
>You don’t. You may find this surprising, but the “required” keyword in Protocol Buffers turned out to be a horrible mistake.
I recommend reading the rest of the FAQ [0], but if you are in a hurry: Fixed schema based protocols like protobuffers do not let you remove fields like self describing formats such as JSON. Removing fields or switching them from required to optional is an ABI breaking change. Nobody wants to update all servers and all clients simultaneously. At that point, you would be better off defining a new API endpoint and deprecating the old one.
The capnproto faq article also brings up the fact that validation should be handled on the application level rather than the ABI level.
Most of the other issues in the article can be solved be wrapping things in more messages. Not great, not terrible.
As with the tightly-coupled issues with Go, I'll keep waiting for a better approach any decade now. In the meantime, both tools (for their glaring imperfections) work well enough, solve real business use cases, and have a massive ecosystem moat that makes them easy to work with.
So HN, what are the best alternatives available today and why?
I don't actually want to do this, because then you have N + 1 implementations of each data type, where N = number of programming languages touching the data, and + 1 for the proto implementation.
What I personally want to do is use a language-agnostic IDL to describe the types that my programs use. Within Google you can even do things like just store them in the database.
The practical alternative is to use JSON everywhere, possibly with some additional tooling to generate code from a JSON schema. JSON is IMO not as nice to work with. The fact that it's also slower probably doesn't matter to most codebases.
I haven't used these very seriously but a problem I had a while back was that that the wire format was not what the applications wanted to use, but a good application format was to space-inefficient for wire.
As far as I could see there was not a great way to do this. You could rewrite wire<->app converter in every app, or have a converter program and now you essentially have two wire formats and need to put this extra program and data movement into workflows, or write a library and maintain bindings for all your languages.
Despite issues, protobufs solve real problems and (imo) bring more value than cost to a project. In particular, I'd much rather work with protobufs and their generated ser/de than untyped json
funnily enough, this line alone reveals the author to be an amateur in the problem space they are writing so confidently about.
I filed an issue requesting this and it was denied with an explanation:
https://github.com/protocolbuffers/protobuf/issues/7791#issu...
Recently, however, I had the displeasure of working with FlatBuffers. It's worse.
I do tend to agree that they are bad. I also agree that people put a little too much credence in "came from Google." I can't bring myself to have this much anger towards it. Had to have been something that sparked this.
https://protobuf.dev/design-decisions/nullable-getters-sette...
Protobuf as a language feels clunky. The “type before identifier” syntax looks ancient and Java-esque.
The tools are clunky too. protoc is full of gotchas, and for something as simple as validation, you need to add a zillion plugins and memorize their invocation flags.
From tooling to workflow to generated code, it’s full of Google-isms and can be awkward to use at times.
That said, the serialization format is solid, and the backward-compatibility paradigms are genuinely useful. Buf adds some niceties to the tooling and makes it more tolerable. There’s nothing else that solves all the problems Protobuf solves.
For messaging, JSON, used in the same way and with the same versioning practices as we have established for evolving schemas in REST APIs, has never failed me.
It seems to me that all these rigid type systems for remote procedure calls introduce more problems that they really solve and bring unnecessary complexity.
Sure, there are tradeoffs with flexible JSONs - but simplicity of it beats the potential advantages we get from systems like Avro or ProtoBuf.
I'm not very upset that protobuf evolved to be slightly more ergonomic. Bolting on features after you build the prototype is how you improve things.
Unfortunately, they really did design themselves into a corner (not unlike python 2). Again, I can't be too upset. They didn't have the benefit of hindsight or other high performance libraries that we have today.
Tag-length-value (TLV) encodings are just overly verbose for no good reason. They are _NOT_ "self-describing", and one does not need everything tagged to support extensibility. Even where one does need tags, tag assignments can be fully automatic and need not be exposed to the module designer. Anyone with a modicum of time spent researching how ASN.1 handles extensibility with non-TLV encoding rules knows these things. The entire arc of ASN.1's evolution over two plus decades was all about extensibility and non-TLV encoding rules!
And yes, ASN.1 started with the same premise as PB, but 40 years ago. Thus it's terribly egregious that PB's designers did not learn any lessons at all from ASN.1!
Near as I can tell PB's designers thought they knew about encodings, but didn't, and near as I can tell they refused to look at ASN.1 and such because of the lack of tooling for ASN.1, but of course there was even less tooling for PB since it hadn't existed.
It's all exasperating.
This particular one provides strongest backward compatibility guarantees with automatic conversion derivation where possible: https://github.com/7mind/baboon
Protobuf is dated, it's not that hard to make better things.
[1] aidlab.com/aidlab-2
Adds a lot of space overhead, specially for structs only used one yet not self descriptive either.
Doesn’t solve a lot of problems related to changes either.
Quite frankly, too many are using up in it because it came from Google and is supposed to be some sort of divinely inspired thing.
JSON, ASN.1, and even rigid C structs start to look a lot better.
He also removed the capability to define a structure, and force you to use dictionary(structure) of array, instead of array of structure.
It's a lesson most people learns the hard way after using PBs for a few months.
It's like how in go most structs don't have a constructor, they just use the 0 value.
Also oneof is made that way so that it is backwards compatible to add a new field and make it a oneof with an existing field. Not everything needs to be pure functional programming.
> This puts us in the uncomfortable position of needing to choose between one of three bad alternatives:
I don’t think there is a good system out there that works for both serialization and data models. I’d say it’s a mostly unsolved problem. I think I am happy with protobufs. I know that I have to fight against them contaminating the codebase—basically, your code that uses protobufs is code that directly communicates over raw RPC or directly serializes data to/from storage, and protobufs shouldn’t escape into higher-level code.
But, and this is a big but, you want that anyway. You probably WANT your serialization to be able to evolve independently of your application logic, and the easy way to do that is to use different types for each. You write application logic using types that have all sorts of validation (in the "parse, don't validate" sense) and your serialization layer uses looser validation. This looser validation is nice because you often end up with e.g. buggy code getting shipped that writes invalid data, and if you have a loose serialization layer that just preseves structure (like proto or json), you at least have a good way to munge it into the right shape.
Evolving serialized types has been such a massive pain at a lot of workplaces and the ad-hoc systems I've seen often get pulled into adopting some of the same design choices as protos, like "optional fields everywhere" and "unknown fields are ok". Partly it may be because a lot of ex-Google employees are inevitably hanging around on your team, but partly because some of those design tradeoffs (not ALL of them, just some of them) are really useful long-term, and if you stick around, you may come to the same conclusion.
In the end I mostly want something that's a little more efficient and a little more typed than JSON, and protos fit the bill. I can put my full efforts into safety and the "correct" representation at a different layer, and yes, people will fuck it up and contaminate the code base with protos, but I can fix that or live with it.
* https://news.ycombinator.com/item?id=18188519
* https://hn.algolia.com/?q=%22Protobuffers+Are+Wrong%22
I guess I'll, once again, copy/paste the comment I made when this was first posted: https://news.ycombinator.com/item?id=18190005
--------
Hello. I didn't invent Protocol Buffers, but I did write version 2 and was responsible for open sourcing it. I believe I am the author of the "manifesto" entitled "required considered harmful" mentioned in the footnote. Note that I mostly haven't touched Protobufs since I left Google in early 2013, but I have created Cap'n Proto since then, which I imagine this guy would criticize in similar ways.
This article appears to be written by a programming language design theorist who, unfortunately, does not understand (or, perhaps, does not value) practical software engineering. Type theory is a lot of fun to think about, but being simple and elegant from a type theory perspective does not necessarily translate to real value in real systems. Protobuf has undoubtedly, empirically proven its real value in real systems, despite its admittedly large number of warts.
The main thing that the author of this article does not seem to understand -- and, indeed, many PL theorists seem to miss -- is that the main challenge in real-world software engineering is not writing code but changing code once it is written and deployed. In general, type systems can be both helpful and harmful when it comes to changing code -- type systems are invaluable for detecting problems introduced by a change, but an overly-rigid type system can be a hindrance if it means common types of changes are difficult to make.
This is especially true when it comes to protocols, because in a distributed system, you cannot update both sides of a protocol simultaneously. I have found that type theorists tend to promote "version negotiation" schemes where the two sides agree on one rigid protocol to follow, but this is extremely painful in practice: you end up needing to maintain parallel code paths, leading to ugly and hard-to-test code. Inevitably, developers are pushed towards hacks in order to avoid protocol changes, which makes things worse.
I don't have time to address all the author's points, so let me choose a few that I think are representative of the misunderstanding.
> Make all fields in a message required. This makes messages product types.
> Promote oneof fields to instead be standalone data types. These are coproduct types.
This seems to miss the point of optional fields. Optional fields are not primarily about nullability but about compatibility. Protobuf's single most important feature is the ability to add new fields over time while maintaining compatibility. This has proven -- in real practice, not in theory -- to be an extremely powerful way to allow protocol evolution. It allows developers to build new features with minimal work.
Real-world practice has also shown that quite often, fields that originally seemed to be "required" turn out to be optional over time, hence the "required considered harmful" manifesto. In practice, you want to declare all fields optional to give yourself maximum flexibility for change.
The author dismisses this later on:
> What protobuffers are is permissive. They manage to not shit the bed when receiving messages from the past or from the future because they make absolutely no promises about what your data will look like. Everything is optional! But if you need it anyway, protobuffers will happily cook up and serve you something that typechecks, regardless of whether or not it's meaningful.
In real world practice, the permissiveness of Protocol Buffers has proven to be a powerful way to allow for protocols to change over time.
Maybe there's an amazing type system idea out there that would be even better, but I don't know what it is. Certainly the usual proposals I see seem like steps backwards. I'd love to be proven wrong, but not on the basis of perceived elegance and simplicity, but rather in real-world use.
> oneof fields can't be repeated.
(background: A "oneof" is essentially a tagged union -- a "sum type" for type theorists. A "repeated field" is an array.)
Two things:
1. It's that way because the "oneof" pattern long-predates the "oneof" language construct. A "oneof" is actually syntax sugar for a bunch of "optional" fields where exactly one is expected to be filled in. Lots of protocols used this pattern before I added "oneof" to the language, and I wanted those protocols to be able to upgrade to the new construct without breaking compatibility.
You might argue that this is a side-effect of a system evolving over time rather than being designed, and you'd be right. However, there is no such thing as a successful system which was designed perfectly upfront. All successful systems become successful by evolving, and thus you will always see this kind of wart in anything that works well. You should want a system that thinks about its existing users when creating new features, because once you adopt it, you'll be an existing user.
2. You actually do not want a oneof field to be repeated!
Here's the problem: Say you have your repeated "oneof" representing an array of values where each value can be one of 10 different types. For a concrete example, let's say you're writing a parser and they represent tokens (number, identifier, string, operator, etc.).
Now, at some point later on, you realize there's some additional piece of data you want to attach to every element. In our example, it could be that you now want to record the original source location (line and column number) where the token appeared.
How do you make this change without breaking compatibility? Now you wish that you had defined your array as an array of messages, each containing a oneof, so that you could add a new field to that message. But because you didn't, you're probably stuck creating a parallel array to store your new field. That sucks.
In every single case where you might want a repeated oneof, you always want to wrap it in a message (product type), and then repeat that. That's exactly what you can do with the existing design.
The author's complaints about several other features have similar stories.
> One possible argument here is that protobuffers will hold onto any information present in a message that they don't understand. In principle this means that it's nondestructive to route a message through an intermediary that doesn't understand this version of its schema. Surely that's a win, isn't it?
> Granted, on paper it's a cool feature. But I've never once seen an application that will actually preserve that property.
OK, well, I've worked on lots of systems -- across three different companies -- where this feature is essential.
what alternative do we have? sending json and base64 strings
the fact that protobuffers wasn’t immediately relegated to the dustbin shows just how low the bar is for serialization formats.
I get the api interoperability between various languages when one wants to build a client with strict schema but in reality, this is more of a theory than real life.
In essence, anyone who subscribes to YAGNI understands that PB and gRPC are a big no-no.
PS: if you need binary format, just use cbor or msgpack. Otherwise the beauty of json is that it human-readable and easily parseable, so even if you lack access to the original schema, you can still EASILY process the data and UNDERSTAND it as well.
If you need to exchange data with other systems that you don't control, a simple format like JSON is vastly superior. You are restricted to handing over tree-like structures. That is a good thing as your consumers will have no problems reading tree-like structures.
It also makes it very simple for each consumer/producer to coerce this data into structs or objects as they please and that make sense to their usage of the data.
You have to validate the data anyhow (you do validate data received by the outside world, do you?), so throwing in coercing is honestly the smallest of your problems.
You only need to touch your data coercion if someone decides to send you data in a different shape. For tree-like structures it is simple to add new things and stay backwards compatible.
Adding a spec on top of your data shapes that can potentially help consumers generate client code is a cherry on top of it and an orthogonal concern.
Making as little assumptions as possible how your consumers deal with your data is a Good Thing(tm) that enabled such useful(still?) things as the WWW.