Friday, May 18, 2012

Recursive descent parser using active patterns

Today's post isn't specifically about games, but about parsing, which I find is a recurring task in many programming tasks, including game-related tasks.

In F#, the most popular methods for writing parsers are FParsec and fslex/fsyacc. Although parser generators are very useful, I'm always a bit reluctant to depend on third-party software.

In the past I have worked on a development tool for safety-critical systems, which was itself subject to some of the limitations of software for safety-critical systems. In particular, the use of lex and yacc was not accepted.

I fell back on a technique suitable for hand-written parsers, namely recursive descent parsers. I was positively surprised by the inherent simplicity of the method and the clarity of code that resulted.

I was first exposed to the idea of using active patterns for parsing when I read a blog post by Jon Harrop.
Active patterns is a feature specific to F#, and the F# wikibook has a chapter dedicated to them. I'll be using parameterized partial active patterns.

We start by defining a type for the input. The integer denotes the number of lines read so far, the list of strings is the input split at line-breaks.
open System.Text.RegularExpressions

type Stream = Stream of int * string list

A pattern to detect the end of the input, when the list of strings is empty.
let (|EOF|_|) = function
    | Stream(_, []) -> Some()
    | _ -> None

A pattern to detect end-of-lines, by recognizing list of strings which start with the empty string.
let (|EOL|_|) = function
    | Stream(n, "" :: rest) ->
        Some (Stream(n + 1, rest))
    | _ -> None
This pattern eats white space.
let (|OneWS|_|) = function
    | Stream(n, line :: rest) ->
        let trimmed = line.TrimStart()
        if trimmed <> line then
            Some (Stream(n, trimmed :: rest))
        else
            None
    | _ -> None
A convenient pattern to eat sequences of white space and newlines. Note the use of the rec keyword, which allows to refer to the pattern itself in its implementation.
let rec (|WS|_|) = function
    | OneWS (WS s)
    | EOL (WS s) -> Some s
    | s -> Some s
We could also have attempted another variant, where WS uses itself for the first part, which would implement a left-recursive grammar. Unfortunately, this pattern would be useless, as all it would do is raise stack-overflow exceptions.
let rec (|WS1|_|) = function
    | WS1 (OneWS s) -> Some s
    | s -> Some s
My variant of the Regex pattern differs a bit from the one in the wikibook, to avoid re-building Regex objects, which has a non-neglectable cost.
let (|Regex|_|) (regex : Regex) = function
    | Stream(n, line :: rest) ->
        let m = regex.Match(line)
        if m.Success then
            let lineRest = line.Substring(m.Length)
            let values =
                [ for gr in m.Groups -> gr.Value ]
            Some (values, Stream(n, lineRest :: rest))
        else None
    | _ -> None
A sample pattern to illustrate the use of the Regex pattern.
let personRegex = Regex("NAME: (\w+) AGE: (\d+)")
let (|Person|_|) = function
    | WS (Regex personRegex ([_ ; name ; age], s)) -> Some ((name, age), s)
    | _ -> None
A pattern to parse data of a large number of persons. I could have used recursion at the level of the pattern itself, similarly to WS. However, such a pattern would not be tail-recursive, and would cause stack overflows for large numbers of entries.
let (|Persons|_|) s =
    let rec work s persons =
        match s with
        | Person (p, s) -> work s (p :: persons)
        | _ -> (persons |> List.rev, s)
    let persons, s = work s []
    Some (persons, s)
Finally, some utility code to initialize streams, generate one million lines to parse, followed by the parsing itself.
let streamOfLines lines =
    Stream(0, lines |> List.ofSeq)

let streamOfFile path =
    Stream(0, System.IO.File.ReadAllLines(path) |> List.ofArray)

let stream =
    seq { for n in 0..1000000 do
            yield sprintf "  NAME: Anonymous AGE: %d              " (n % 99) }
    |> streamOfLines

#time "on"
match stream with
| Persons (p, WS EOF) -> printfn "Success %A" p
| _ -> printfn "Failed"
#time "off"
Parsing takes about 9 seconds on my PC (18s if I recreate the Regex object repeatedly). I think that's very decent performance.
In a future post, I'll describe how to parse Boolean formulas and expand a bit on the power and performance of this method.