FRESH

Hacker News

Parsing Advances

108 points by birdculture

by smj-edison

1 subcomments

Huh, that's a really interesting approach. I just wrote my first Pratt parser a month ago, and one of the most annoying things was debugging infinite loops in various places (I had both tokenizer bugs where no characters were consumed and parser bugs where a token was emitted but not advanced). It's doubly annoying in Zig, because the default test runner won't print out stdout at all, and won't print stderr unless the program terminates by itself (Ctrl + C doesn't print). I resorted to building the test and running it manually, or jumping into a debugger to figure out recursion issues. It's working now, but if (really when) I run into issues in the future I'll definitely add some helper functions to check emitting invariants.

by sureglymop

1 subcomments

In rust I really like the grmtools set of tools: https://github.com/softdevteam/grmtools.
It is lexx/yacc style lexer and parser generation and generates an LR1 parser but using the CPCT+ algorithm for error recovery. Iirc the way it works is that when an error occurs, the nearest likely valid token is inserted, the error is recorded and parsing continues.
I would use this for anything that is simple enough and recursive descent for anything more complicated and where even more context is needed for errors.

by dcrazy

1 subcomments

I’m curious why the author chose to model this as an assertion stack. The developer must still remember to consume the assertion within the loop. Could the original example not be rewritten more simply as:

    const result: ast.Expression[] = [];
    p.expect("(");
    while (!p.eof() && !p.at(")")) {
     subexpr = expression(p);
     assert(p !== undefined); // << here
     result.push(subexpr);
     if (!p.at(")")) p.expect(",");
    }
    p.expect(")");
    return result;

by kccqzy

3 subcomments

How about another way, which is memoization: at each position in the source code we never attempt to parse the same production more than once. This solves infinite looping as discussed by the author because the “loop” will be downgraded by the memoization to execute once. Of course I wouldn't literally use a while loop in code to represent the production. I would use a higher-level abstraction to indicate one-or-more or zero-or-more in the production; indeed I would represent productions as data not code.
This also has another benefit of work sharing. A production like `A B | C B` will ensure that in case parsing A or C consumes the same number of characters, the work to parse B will be shared, despite not literally factoring the production into `(A | C) B`.

by eru

1 subcomments

Writing parsers by hand this way can be fun (and might be required for the highest performance ones, maybe?), but for robustness and ease of development you are generally better off using a parser combinator library.

by tgv

1 subcomments

So ... someone calls their parsing strategy "resilient LL parsing" without actually implementing LL parsing, a technique known since the 1970s, and then has an infinite recursion bug? Probably skipped Parsing 101.