Niko Matsakis: Nice errors in LALRPOP

Среда, 02 Марта 2016 г. 20:58 + в цитатник

For the last couple of weeks, my mornings have been occupied with a pretty serious revamping of LALRPOP’s error message output. I will probably wind up doing a series of blog posts about the internal details of how it works, but I wanted to write a little post to advertise this work.

Typically when you use an LR(1) parser generator, error messages tend to be written in terms of the LR(1) state generation algorithm. They use phrases like shift/reduce conflict and talk about LR(1) items. Ultimately, you have to do some clever thinking to relate the error to your grammar, and then a bit more clever thinking to figure out how you should adjust your grammar to make the problem go away. While working on adapting the Rust grammar to LALRPOP, I found I was wasting a lot of time trying to decrypt the error messages, and I wanted to do something about it. This work is the result.

An aside: It’s definitely worth citing Menhir as an inspiration, which is an awesome parser generator for OCaml. Menhir offers a lot of the same features that LALRPOP does, and in particular generates errors very similar to those I am talking about here.

What I’ve tried to do now in LALRPOP is to do that clever thinking for you, and instead present the error message in terms of your grammar. Perhaps even more importantly, I’ve also tried to identify common beginner problems and suggest solutions. Naturally this is a work-in-progress, but I’m already pretty excited with the current status, so I wanted to write up some examples of it in action.

Diagnosing ambiguous grammars

Let’s start with an example of a truly ambiguous grammar. Imagine that I have this grammar for a simple calculator (in LALRPOP syntax, which I hope will be mostly self explanatory):

use std::str::FromStr;
grammar;
pub Expr: i32 = {
    <n:r"[0-9]+"> => i32::from_str(n).unwrap(),
    <l:Expr> "+" <r:Expr> => l + r,
    <l:Expr> "-" <r:Expr> => l - r,
    <l:Expr> "*" <r:Expr> => l * r,
    <l:Expr> "/" <r:Expr> => l / r,
};

This grammar evaluates expressions like 1 + 2 * 3 and yields a 32-bit integer as the result. The problem is that this grammar is quite ambiguous: it does not encode the precedence of the various operators in any particular way. The older versions of LALRPOP gave you a rather opaque error concerning shift/reduce conflicts. As of version 0.10, though, you get this (the actual output even uses ANSI colors, if available):

calc.lalrpop:6:5: 6:34: Ambiguous grammar detected

  The following symbols can be reduced in two ways:
    Expr "*" Expr "*" Expr

  They could be reduced like so:
    Expr "*" Expr "*" Expr
    +-Expr------+        |
    +-Expr---------------+

  Alternatively, they could be reduced like so:
    Expr "*" Expr "*" Expr
    |        +-Expr------+
    +-Expr---------------+

  Hint: This looks like a precedence error related to `Expr`. See the LALRPOP
  manual for advice on encoding precedence.

Much clearer, I’d say! And note, if you look at the last sentence, that LALRPOP is even able to diagnose that this an ambiguity specifically about precedence and refer you to the manual – now, if only I’d written the LALRPOP manual, we’d be all set.

I should mention that LALRPOP also reports several other errors, all of which are related to the precedence. For example, it will also report:

/Users/nmatsakis/tmp/prec-calc.lalrpop:6:5: 6:34: Ambiguous grammar detected

  The following symbols can be reduced in two ways:
    Expr "*" Expr "+" Expr

  They could be reduced like so:
    Expr "*" Expr "+" Expr
    +-Expr------+        |
    +-Expr---------------+

  Alternatively, they could be reduced like so:
    Expr "*" Expr "+" Expr
    |        +-Expr------+
    +-Expr---------------+

  LALRPOP does not yet support ambiguous grammars. See the LALRPOP manual for
  advice on making your grammar unambiguous.

The code for detecting precedence errors however doesn’t consider errors between two distinct tokens (here, * and +), so you don’t get a specific message, just a general note about ambiguity. This seems like an area that would be nice to improve.

Diagnosing LR(1) limitations and suggesting inlining

That last example was a case where the grammar was fundamentally ambiguous. But sometimes there are problems that have to do with how LR(1) parsing works; diagnosing these nicely is even more important, because they are less intuitive to the end user. Also, LALRPOP has several tools that can help make dealing with these problems easier, so where possible we’d really like to suggest these tools to users.

Let’s start with a grammar for parsing Java import declarations. Java’s import declarations have this form:

1 2	`import java.util.*; import java.lang.String;`

A first attempt at writing a grammar for them might look like this (in this grammar, I gave all of the nonterminals the type (), so there is no need for action code; this means that this grammar does not build a parse tree, and so it can only be used to decide if the input is legal Java or not):

grammar;

pub ImportDecl: () = {
    "import" Path ";",
    "import" Path "." "*" ";",
};

Path: () = Ident ("." Ident)*;

Ident = r#"[a-zA-Z][a-zA-Z0-9]*"#;

Now, unlike before, this grammar is unambiguous. Nonetheless, if we try to run it through LALRPOP, we will get the following error:

java.lalrpop:8:12: 8:29: Local ambiguity detected

  The problem arises after having observed the following symbols in the input:
    "import" Ident
  At that point, if the next token is a `"."`, then the parser can proceed in
  two different ways.

  First, the parser could execute the production at java.lalrpop:8:12: 8:29,
  which would consume the top 1 token(s) from the stack and produce a `Path`.
  This might then yield a parse tree like
    "import" Ident

http://smallcultfollowing.com/babysteps/blog/2016/03/02/nice-errors-in-lalrpop/