In F#, the most popular methods for writing parsers are FParsec and fslex/fsyacc. Although parser generators are very useful, I'm always a bit reluctant to depend on third-party software.
In the past I have worked on a development tool for safety-critical systems, which was itself subject to some of the limitations of software for safety-critical systems. In particular, the use of lex and yacc was not accepted.
I fell back on a technique suitable for hand-written parsers, namely recursive descent parsers. I was positively surprised by the inherent simplicity of the method and the clarity of code that resulted.
I was first exposed to the idea of using active patterns for parsing when I read a blog post by Jon Harrop.
Active patterns is a feature specific to F#, and the F# wikibook has a chapter dedicated to them. I'll be using parameterized partial active patterns.
We start by defining a type for the input. The integer denotes the number of lines read so far, the list of strings is the input split at line-breaks.
open System.Text.RegularExpressions type Stream = Stream of int * string list
A pattern to detect the end of the input, when the list of strings is empty.
let (|EOF|_|) = function | Stream(_, []) -> Some() | _ -> None
A pattern to detect end-of-lines, by recognizing list of strings which start with the empty string.
let (|EOL|_|) = function | Stream(n, "" :: rest) -> Some (Stream(n + 1, rest)) | _ -> NoneThis pattern eats white space.
let (|OneWS|_|) = function | Stream(n, line :: rest) -> let trimmed = line.TrimStart() if trimmed <> line then Some (Stream(n, trimmed :: rest)) else None | _ -> NoneA convenient pattern to eat sequences of white space and newlines. Note the use of the rec keyword, which allows to refer to the pattern itself in its implementation.
let rec (|WS|_|) = function | OneWS (WS s) | EOL (WS s) -> Some s | s -> Some sWe could also have attempted another variant, where WS uses itself for the first part, which would implement a left-recursive grammar. Unfortunately, this pattern would be useless, as all it would do is raise stack-overflow exceptions.
let rec (|WS1|_|) = function | WS1 (OneWS s) -> Some s | s -> Some sMy variant of the Regex pattern differs a bit from the one in the wikibook, to avoid re-building Regex objects, which has a non-neglectable cost.
let (|Regex|_|) (regex : Regex) = function | Stream(n, line :: rest) -> let m = regex.Match(line) if m.Success then let lineRest = line.Substring(m.Length) let values = [ for gr in m.Groups -> gr.Value ] Some (values, Stream(n, lineRest :: rest)) else None | _ -> NoneA sample pattern to illustrate the use of the Regex pattern.
let personRegex = Regex("NAME: (\w+) AGE: (\d+)") let (|Person|_|) = function | WS (Regex personRegex ([_ ; name ; age], s)) -> Some ((name, age), s) | _ -> NoneA pattern to parse data of a large number of persons. I could have used recursion at the level of the pattern itself, similarly to WS. However, such a pattern would not be tail-recursive, and would cause stack overflows for large numbers of entries.
let (|Persons|_|) s = let rec work s persons = match s with | Person (p, s) -> work s (p :: persons) | _ -> (persons |> List.rev, s) let persons, s = work s [] Some (persons, s)Finally, some utility code to initialize streams, generate one million lines to parse, followed by the parsing itself.
let streamOfLines lines = Stream(0, lines |> List.ofSeq) let streamOfFile path = Stream(0, System.IO.File.ReadAllLines(path) |> List.ofArray) let stream = seq { for n in 0..1000000 do yield sprintf " NAME: Anonymous AGE: %d " (n % 99) } |> streamOfLines #time "on" match stream with | Persons (p, WS EOF) -> printfn "Success %A" p | _ -> printfn "Failed" #time "off"Parsing takes about 9 seconds on my PC (18s if I recreate the Regex object repeatedly). I think that's very decent performance.
In a future post, I'll describe how to parse Boolean formulas and expand a bit on the power and performance of this method.