Tuesday, June 26, 2012

Recursive descent parsers using active patterns, part 2

In the previous post, I presented a method to implement a parser relying solely on core F# features, namely active patterns.
In this post, I'll touch on some of advantages and limits of this approach.
Recursive descent parsers are limited to LL(k) grammars, which means that left-recursive grammars cannot be directly handled. It is necessary to rewrite such grammars to remove left recursion.

Consider the common problem of parsing expressions, e.g. Boolean expressions. A typical formulation of the  grammar could be:

Expr := AndExpr | OrExpr | NegExpr | Variable | True | False | "(" Expr ")"
AndExpr := Expr "AND" Expr
OrExpr := Expr "OR" Expr
NegExpr := "~" Expr | Expr

Note there is an ambiguity, as "A AND B OR C" could be parsed as (A AND B) OR C or as A AND (B OR C). In order to lift this ambiguity, priorities can be assigned to rules. Similarly for associativity, allowing to parse "A AND B AND C" as (A AND B) AND C or A AND (B AND C).

Such a grammar, if implemented naively using a recursive descent parser, would loop endlessly on some inputs: The parser would try to match rule AndExpr, which would try to match Expr, which would again try to match AndExpr...

An alternative formulation of the grammar can be used that will generate semantically equivalent results

Expr := OrExpr
OrExpr := AndExpr "OR" OrExpr | AndExpr
AndExpr := NegExpr "AND" AndExpr | NegExpr
NegExpr := "~" AtomExpr | AtomExpr
AtomExpr := Variable | True | False | "(" Expr ")"

Note the hierarchy between Expr, OrExpr, AndExpr, NegExpr and AtomExpr. The left-most rule appearing on an alternative in the right-hand side of each definition is always a "lower" rule: Expr refers to OrExpr, OrExpr to AndExpr, AndExpr to NegExpr, NegExpr to AtomExpr.
AtomExpr refers to the higher rule Expr in "(" Expr ")", but that's OK as Expr does not appear as the left-most rule (the token for a left parenthesis does).

What I like with this formulation is that it captures both associativity and operator precedence within the system of rules.

It is common to split parsers into two stages, using a lexer that processes the input stream in linear time, producing a pre-digested stream of so-called tokens. While this has undeniable advantages when it comes to performance, it prevents composing grammars and rule-specific tokenization. This is why most languages have keywords reserved for future use, and this also explains why you can't use identifiers that are also keywords even in situations where no ambiguity arises.

For these reasons, I personally prefer avoiding the separate lexing stage if performance constraints allow for it.

Regarding performance, it's important to write a recursive descent parser in such a way that decisions on the rule to try can always be done looking at a finite number of tokens.

In the example below, the second implementation performs significantly better:

let (|OrExpr|_|) s =
  match s with
  | AndExpr (e1, Token ("OR", OrExpr(es, s))) -> Some (e1 :: es, s)
  | AndExpr (e, s) -> Some ([e], s)
  | _ -> None
let (|OrExpr|_|) s =
  match s with
  | AndExpr (e1, s) ->
    let rec work s es =
      match s with
      | Token ("OR", AndExpr(e, s)) -> work s (e :: es)
      | _ -> (List.rev es, s)
    let es, s = work s []
    Some(e1 :: es, s)
  | _ -> None

To see what happens, consider the input "~X". This is a "degenerate" OrExpr with a single term. The first implementation will go all the way down through each rule to identify a NegExpr, promoting it to an AndExpr, then will look for "OR", which isn't there. The first match case will therefore fail, and the second will be tried, redoing the work. The same problem exists at all levels in AndExpr, NegExpr and AtomExpr. What makes matters worse is that this kind of "degenerate" case is actually the normal common case in typical expressions.

The second implementation uses the functional equivalent of a while loop, going as far as possible. The trick here is to factorize common prefixes into a single branch, thus avoiding redoing the work in case of a match failure.

An important problem with my simple example code is the lack of error messages. Those are important from a usability point of view, but I'm currently not too sure how to do this with manually-written parsers.

To summarize, using this technique requires some amount of manual work, which is the price to pay for not using smarter parsing frameworks. I wouldn't advise it over using parser generators, but it's interesting nonetheless. I can see it having uses for extremely small grammars, for instance when parsing fragments in a document. Otherwise, go for a parser generator such as fslex+fsyacc or ANTLR (which generates recursive descent parsers), or combinator-based parsers such as FParsec.