FRESH

Hacker News

Home

I rewrote PostHog's SQL parser, 70x faster, while barely looking at the code

141 points by robbie-c

by duendefm

4 subcomments

Well despite my current anti AI sentiment, I have to admit that after reading the article, It was a good use of AI, done by someone with good technical skills. Still I have the feeling that this only works because of the vast accumulated knowledge pre-AI, and if everybody keeps going in this path, it will end up making everyone not advancing their knowledge at the pace they did before. I feel that this AI immersion is really about selling our soul to the devil for short term gains.

by cespare

2 subcomments

> We didn't write this parser by hand because, at least pre-AI-coding, parsers were extremely difficult to maintain. Writing one without AI would have taken months [...]
> Instead, we use ANTLR, a state-of-the-art, open source parser generator.
I don't agree with this (pre-AI-coding) take. Hand-rolled parsers are much easier to write well and maintain than people think. They also tend to be much faster and produce much better errors than parser generators. I guess if the language you're trying to parse is, say, C++, then you're going to have a miserable time (probably no matter what). But an SQL parser is very doable. (I say this as the author and maintainer of an in-house SQL dialect thingy at work.)
What makes building and maintaining a hand-written parser such a tractable task is:
- The code size can be large, but you can start with a core of a few well-chosen abstractions and then you add lots of parsing code for various language constructs but it's all kind of orthogonal and doesn't add compounding complexity as you go. - It's just about the most testable kind of code there is. You can cover all the various corner cases with tests and really lock in the behavior so that you can very confidently make changes. One approach I like is to make zillions of tiny test files in the target language accompanied by some golden representation of the AST.
And of course, as the author found out, these properties make writing a parser a really good task for AI coding, too. These tools are very, very good at generating a bunch of new code based on existing abstractions and covering it with lots of test cases.
So I agree with where they ended up, just not where they started :)

by joshmoody24

0 subcomment

The old parser has p95 of like 450ms, that seems weirdly slow to me even for a parser generator. Is my intuition wrong? Maybe it's parsing some truly enormous SQL queries?

by jakewins

0 subcomment

I’ve had very good success in similar setups where you have some sort of “oracle” and can generate enormous corpuses of test data, such that you really, really trust the LLM code must work for the inputs you expect it’ll ever need to handle.
Makes me think of all the algorithms we specify in proof languages and then hand-implement in production languages - this setup could maybe let you just specify the proof of an algorithm and then let LLMs derive efficient implementations with the (slow) proof as an oracle

by theLiminator

0 subcomment

This is the type of problem for which LLM generation is great for.
If you have an oracle, and your problem is largely just a pure function, it's pretty good at generating something that both works and is fast.

by mikkelam

6 subcomments

I cannot believe they're sticking to their guns on this website design. It's awful.

by jamestexas

0 subcomment

This is super cool, and I am totally going to glean from how you handled testing some of this.
I have a tool I make as a data-plane to a graph engine, and it uses cap'n proto to help (And sqlite as a sort've IPC option). One of the biggest things I have is, I know I am not testing all of it to completion. I am not even really fuzzing, yet.
Thanks for sharing!

by justAnotherHero

0 subcomment

That's great but I really wish you guys would do something about the llm integration, I tried using it two days ago to create a cohort of users using a sql query, and I was surprised to see that it said that it could not create cohorts for me and i had to resort to exporting data from a sql insight as a cohort cannot use a sql query. However the worst part was it just writing in the text input slowed down my m4 pro chip to less than 1 fps after 2 prompts and it really left a bad taste in my mouth.
Perhaps the next target for a 100x improvement

by russellthehippo

1 subcomments

The key parts of this is how not vibecoded it is. Feels like a model of how you should do software with AI. Now that we can easily set up property testing, fuzzing, etc. there's almost no reason not to.

by keeda

0 subcomment

A while ago I had predicted that eventually all coding would eventually become vibe-coding but it would still be a deep engineering discipline (https://news.ycombinator.com/item?id=48040206) -- this is what I meant. Deep technical expertise is still needed, but it shifts from working with the code directly to crafting bespoke comprehensive validation mechanisms around the code. This is a great example of what that could look like.
So it's technically vibe-coding in the sense you don't really look at the code, you just look at the results and "go by the vibes"... except now you're working to rigorously quantify and enforce those vibes. (Philosophical aside: once vibes are rigorously enforced are they "vibes" anymore?)

by ndr

1 subcomments

Great loop spotting!
Recently I was messing around with parquet files in Python and ended up needing to ship the results on Windows, without a Windows machine to test on.
Shipping Python to end users is half mad already, and doing it on Windows is exactly the kind of thing I don't want to spend my life maintaining.
So I figured I'd rewrite it in Go. But that meant embedding a DLL, and how would I test it? I could spin up a VM, sure. But GitHub Actions already has a Windows environment, and there was my loop: let the agent push to the repo, run tests in GHA, rinse and repeat.
In under an hour it had a full rewrite of my Python, passing every test and producing row-for-row copies of my Parquet output. And it does work on the user machine!
Spotting a loop like that is as satisfying as noticing you can walk your chess opponent into a smothered mate. Truly empowering.

by lovasoa

4 subcomments

The thing I would have liked to know is why they don't use an existing fast SQL parser. Was being slightly incompatible with all existing SQL dialects a product requirement?

by boiler_up800

0 subcomment

Very good and interesting article, particularly the “loop” that he ended up with.
Amusing anecdotes on LLMs to:
> It did, in fact, make a lot of mistakes, kept doubting whether such a rewrite was even possible, and wanted to call it a day after each round of coding.
> Hilariously one of the most effective was to tell Claude to “think really hard about edge cases" in a background agent.

by zingar

0 subcomment

This must the most compelling look I’ve seen at how software might work with LLMs doing a ton of heavy lifting.
There’s something kind of amazing here in that having read about property based testing I’m pretty confident I could apply it if I had a good use case.

by sam_lowry_

0 subcomment

Dunno about the parser, but you broke scrolling on your fancy website without noticing it also ;-)

by spinachsalad

1 subcomments

> there’s a test for SELECT SELECT FROM FROM WHERE WHERE AND AND which is completely valid SQL
Is this even true? I tried it in SQLite and there's a syntax error after first SELECT. It would work when "SELECT", "FROM" etc. are quoted, but that's not the same thing.

by duke_of_vandals

2 subcomments

How long did this take?

by ncruces

0 subcomment

You have a grammar file in a formal language, and want to generate a faster parser in another formal language.
What's wrong with the source language that it's better to use a sufficiently smart random code generator for the target language, and then fuzz the hell out of the output of it until it behaves the same as the slow translated code, than to create a sufficiently smart compiler from the source to target languages?
I mean this sounds like if we replaced GCC with a really smart random assembly generator and a fuzzer for the output.

by sayrer

0 subcomment

ha, try to keep going. Run it under samply and Gungraun (need AMD64 for this)

by westurner

0 subcomment

Could the agent traces from this be used to improve sqlglot?
tobymao/sqlglot: Python SQL Parser and Transpiler; with tests and support for 30+ dialects: https://github.com/tobymao/sqlglot
Ibis depends upon sqlglot: https://github.com/tobymao/sqlglot/network/dependents

by akitowerns

0 subcomment

[dead]

by jungfty

0 subcomment

[dead]

by speedgoose

2 subcomments

[flagged]

by elmean

0 subcomment

[flagged]

by pluc

0 subcomment

[dead]

by CrzyLngPwd

0 subcomment

"I didn't rewrite"

by sscaryterry

1 subcomments

Good read, but "70x" is always misleading.