FRESH

Hacker News

Home

Parse, Don't Validate (2019)

248 points by shirian

by seanwilson

16 subcomments

Maybe I'm missing something and I'm glad this idea resonates, but it feels like sometime after Java got popular and dynamic languages got a lot of mindshare, a large chunk of the collective programming community forgot why strong static type checking was invented and are now having to rediscover this.
In most strong statically typed languages, you wouldn't often pass strings and generic dictionaries around. You'd naturally gravitate towards parsing/transforming raw data into typed data structures that have guaranteed properties instead to avoid writing defensive code everywhere e.g. a Date object that would throw an exception in the constructor if the string given didn't validate as a date (Edit: Changed this from email because email validation is a can of worms as an example). So there, "parse, don't validate" is the norm and not a tip/idea that would need to gain traction.

by macintux

0 subcomment

A frequent visitor to HN. Tip: if you click on the "past" link under the title (but not the "past" link at the top of the page), you'll trigger a search for previous posts.
https://hn.algolia.com/?query=Parse%2C%20Don%27t%20Validate&...
However, it's more effective to throw quotes into the mix, reduces false positives.
https://hn.algolia.com/?dateRange=all&page=0&prefix=true&que...

by zdw

3 subcomments

This is a great article, but people often trip over the title and draw unusual conclusions.
The point of the article is about locality of validation logic in a system. Parsing in this context can be thought as consolidating the logic that makes all structure and validity determination about incoming data into one place in the program.
This lets you then rely on the fact that you have valid data in a known structure in all other parts of the program, which don't have to be crufted up with validation logic when used.
Related, it's worth looking at tools that further improve structure/validity locality like protovalidate for protobuf, or Schematron for XML, which allow you to outsource the entire validity checking to library code for existing serialization formats.

by r4victor

3 subcomments

It seems modern statically-typed and even dynamically-typed languages all adopted this idea, except Go, where they decided zero values represent valid states always (or mostly).
A sincere question to Go programmers – what's your take on "Parse, Don't Validate"?

by dang

1 subcomments

Related. Others?
Parse, Don't Validate (2019) - https://news.ycombinator.com/item?id=41031585 - July 2024 (102 comments)
Parse, don't validate (2019) - https://news.ycombinator.com/item?id=35053118 - March 2023 (219 comments)
Parse, Don't Validate (2019) - https://news.ycombinator.com/item?id=27639890 - June 2021 (270 comments)
Parse, Don’t Validate - https://news.ycombinator.com/item?id=21476261 - Nov 2019 (230 comments)
Parse, Don't Validate - https://news.ycombinator.com/item?id=21471753 - Nov 2019 (4 comments)

by d0liver

1 subcomments

I think, more generally, "push effects to the edges" which includes validation effects like reporting errors or crashing the program. If you, hypothetically, kept all of your runtime data in a big blob, but validated its structure right when you created it, then you could pass around that blob as an opaque representation. You could then later deserialize that blob and use it and everything would still be fine -- you'd just be carrying around the validation as a precondition rather than explicitly creating another representation for it. You could even use phantom types to carry around some of the semantics of your preconditions.
Point being: I think the rule is slightly more general, although this explanation is probably more intuitive.

by pcwelder

0 subcomment

Each repost is worth it.
This, along with John Ousterhout's talk [1] on deep interfaces was transformational for me. And this is coming from a guy who codes in python, so lots of transferable learnings.
[1] https://www.youtube.com/watch?v=bmSAYlu0NcY

by throw567643u8

1 subcomments

What's lexi up to these days? Her last big contribution to Haskell was the delimited continuation primops, then she disappeared in a puff of smoke.

by kayo_20211030

0 subcomment

A great piece.
Unfortunately, it's somewhat of a religious argument about the one true way. I've worked on both sides of the fence, and each field is equally green in its own way. I've use OCaml, with static typing, and Clojure, with maybe-opt-in schema checking. They both work fine for real purposes.
The big problem arrives when you mix metaphors. With typing, you're either in, or you're out - or should be. You ought not to fall between stools. Each point of view works fine, approached in the right way, but don't pretend one thing is the other.

by rorylaitila

0 subcomment

I make great use of value objects in my applications but there are things I needed to do to make it ergonomic/performant. A "small" application of mine has over 100 value objects implemented as classes. Large apps easily get into the 1000s of classes just for value objects. That is a lot of boilerplate. It's a lot of boxing/unboxing. It'd be a lot of extra typing than "stringly typed" programs.
To make it viable, all value objects are code-generated from model schemas, and then customized as needed (only like 5% need customization beyond basic data types). I have auto-upcasting on setters so you can code stringly when wanted, but everything is validated (very useful for writing unit tests more quickly). I only parse into types at boundaries or on writes/sets, not on reads/gets (limit's the amount of boxing, particularly on reading large amounts of data). Heavy use of reflection, and auto-wiring/dependency injection.
But with these conventions in place, I quite enjoy it. Easy to customize/narrow a type. One convention for all validation. External inputs are by default secure with nice error messages. Once place where all values validation happens (./values classes folder).

by 1-more

0 subcomment

A related talk is Richard Feldman's "Making Impossible States Impossible." Richard wrote a number of Elm packages and is the creator of the Roc language.
https://www.youtube.com/watch?v=IcgmSRJHu_8

by gaigalas

0 subcomment

> Now I have a single, snappy slogan that encapsulates what type-driven design means to me, and better yet, it’s only three words long
IMHO this is distracting and sort of vain. It forces this "semantics" perspective into the reader, just so the author can have a snappy slogan.
Also, not all languages have such freedom in type expressiveness. Some of them have but offer terrible trade-ofs.
The truth is, if you try to be that expressive in a language that doesn't support it you'll end up with a horror story. The article fails to mention that, and that "snappy slogan" makes it look like it's an absolute claim that you must internalize, some sort of deep truth that applies everywhere. It isn't.

by Joel_Mckay

0 subcomment

An unconstrained json/bson parser without recursive structure limits must be bounded somehow. In many cases, the ordering of marshaled data cannot be guaranteed across platforms.
The best method is walk the symbolic tree with a cost function, and score the fitness of the data compared to expected structures. For example, mismatched or duplicate GUID/Account/permission/key fields reroute the message to the dead-letter queue for analysis, missing required fields trigger error messaging, and missing optional fields lower the qualitative score of the message content.
Parsers can be extremely unpredictable, and loosely typed formats are dangerous at times. =3

by hackrmn

0 subcomment

This article has done rounds on the ITernet before. Maybe because it resonates with people (who repost it time and again). Anyway, I very much agree with the idea. In my experience, "text" or "string" is not a type. Technically it is one, of course, but I seldom see good use of it for when a more apt type would do better -- in short, it's a last resort thing, and it fares badly there too. Ironically, the only good use for it is as input to a... parser.
I see a lot of URLs being passed around as strings within a system perfectly capable of leveraging typing theory and offering user defined types, if not at least through OOP goodness a lot of people would furiously defend. The URL, in this case, would often have _already_ been parsed once, but effectively "unparsed" and keeps being sent around as text in need of parsing at every "junction" of the system that requires to meaningfully access it, except that parsing is approached like some ungodly litany best avoided and thus foregone or lazily implemented with a regex where a regex isn't nearly sufficient. Perhaps it's because we lack parsers, by and large, or in the very least parser generators that are readily available, understandable (to your average developer), and simple enough to use without requiring to understand formal language theory with Chomsky hierarchy, context sensitivity, grammar ambiguity and parse forests, to say the least.
Same with [file] paths, HTTP header values, and other things that seem alluring to dismiss as only being text.
It wouldn't be a problem, had I not seen time and again how the "text" breaks -- URLs with malformed query parameters because why not just do `+ '?' + entries.map(([ name, value ]) => name + "=" + value).join("&")`, how hard can it be? Paths that assume leading slash or lack there of etc.
I believe the article was born precisely of the same class of frustrations. So I am now bringing the same mantra everywhere with me: "There is no such type as string". Parse at earliest opportunity, lazily if the language allows it (most languages do) -- breadth first so as to not pay upfront, just don't let the text slip through.
I am talking from experience, really, your mileage may vary.

by tlavoie

0 subcomment

Along with all the general discussion, I found the concept of defensive parsing striking a chord when reading this as well: "The Seven Turrets of Babel: A Taxonomy of LangSec Errors and How to Expunge Them", https://langsec.org/papers/langsec-cwes-secdev2016.pdf
I'd love for these ideas to take hold at work, but I'm on the fringes in infosec, not a dev.

by benhoyt

2 subcomments

I'm not very familiar with functional programming and Haskell in particular. I think I understand the gist of this article, and "use data structures that make illegal states unrepresentable". However, is there a similar article but written with more common languages (C#, C++, Java, Go) in mind? Or is a big part of this concept only relevant for strong functional languages with sum types and pattern matching?

by sevensor

1 subcomments

Making illegal states unrepresentable sounds like a great idea, and it is, but I see it getting applied without nuance. “Has multiple errors” can be a valid type. Instead of bailing immediately, you can collect all of the errors so that they can be reported all together rather than forcing the user to fix one error at a time.

by exodys

2 subcomments

Maybe I am being contrarian, or maybe I don't understand; if I am reading input, I am always going to validate that input after parsing. Especially if it is from a user.
I understand that they should be separate, but they should be very close together.

by mmis1000

0 subcomment

This article always end up relevant once in a while.
Recently, I am trying to make llm to output specific format.
It turns out no matter how you wrote propmt and perform validate. It will never be as effective as just limit the output with proper bnf (via llama cpp grammar file).

by yakshaving_jgt

0 subcomment

I did a lightning talk on this topic last year, with a concrete example in Yesod.
https://www.youtube.com/watch?v=MkPtfPwu3DM

by curiousgal

4 subcomments

Semi tangent but I am curious. for those with more experience in python, do you just pass around generic Pandas Dataframes or do you parse each row into an object and write logic that manipulates those instead?

by cbondurant

0 subcomment

A really mindset-altering read for me, I've carried this way of thinking ever since I'd first read it a few years ago.

by LordDragonfang

1 subcomments

I'll be honest, as someone not familiar with Haskell, one of my main takeaways from this article is going down a rabbit hole of finding out how weird Haskell is.
The casualness at which the author states things like "of course, it's obvious to us that `Int -> Void` is impossible" makes me feel like I'm being xkcd 2501'd.

by metalliqaz

0 subcomment

bonus points for the correct use of "cromulent"

by whalesalad

1 subcomments

The author's point here is great, but the post does (imho) a poor job illustrating it.
The tl;dr on this is: stop sprinkling guards and if statements all over your codebase. Convert (parse) the data into truthful objects/structs/containers at the perimieter. The goal is to do that work at the boundaries of your system, so that inside of your system you can stop worrying about it and trust the value objects you have.
I think my hangup here is on the use of terms parse vs validate. They are not the right terms to describe this.

by danieltanfh95

6 subcomments

Hot take: Static typing is often touted as the end all be all, and all you need to do is "parse, don't validate" at the edge of your program and everything is fine and dandy.
In practice, I find that staunch static typing proponents are often middle or junior engineeers that want to work with an idealised version of programming in their heads. In reality what you are looking for is "openness" and "consistency", because no amount of static typing will save you from poorly defined or optimised-too-early types that encode business logic constraints into programmatic types.
This is also why in practice alot of customer input ends up being passed as "strings" or have a raw copy + parsed copy, because business logic will move faster than whatever code you can write and fix, and exposing it as just "types" breaks the process for future programmers to extend your program.

by waffletower

2 subcomments

I'm sorry, I don't like to title drop, but I am a Staff Data Engineer and I find that "type driven" development is an inappropriate world view for many programming contexts that I encounter. I use "world view" carefully as it makes a contractual assumption about reality -- "give me what I expect". Data processing does not always have the luxury of such imposition. In these contexts a dynamic and introspective world view is more appropriate, "What do we have here?" "What can we use?". In 2019 I would have felt crippled by use of Haskell in data processing contexts and have instead done much in Clojure in these intervening years, though now LLM assisted use of Haskell toward such tasks would be a fun spectator sport.