In F#, the most popular methods for writing parsers are FParsec and fslex/fsyacc. Although parser generators are very useful, I'm always a bit reluctant to depend on third-party software.
In the past I have worked on a development tool for safety-critical systems, which was itself subject to some of the limitations of software for safety-critical systems. In particular, the use of lex and yacc was not accepted.
I fell back on a technique suitable for hand-written parsers, namely recursive descent parsers. I was positively surprised by the inherent simplicity of the method and the clarity of code that resulted.
I was first exposed to the idea of using active patterns for parsing when I read a blog post by Jon Harrop.
Active patterns is a feature specific to F#, and the F# wikibook has a chapter dedicated to them. I'll be using parameterized partial active patterns.
We start by defining a type for the input. The integer denotes the number of lines read so far, the list of strings is the input split at line-breaks.
open System.Text.RegularExpressions type Stream = Stream of int * string list
A pattern to detect the end of the input, when the list of strings is empty.
let (|EOF|_|) = function
| Stream(_, []) -> Some()
| _ -> None
A pattern to detect end-of-lines, by recognizing list of strings which start with the empty string.
let (|EOL|_|) = function
| Stream(n, "" :: rest) ->
Some (Stream(n + 1, rest))
| _ -> None
This pattern eats white space.
let (|OneWS|_|) = function
| Stream(n, line :: rest) ->
let trimmed = line.TrimStart()
if trimmed <> line then
Some (Stream(n, trimmed :: rest))
else
None
| _ -> None
A convenient pattern to eat sequences of white space and newlines. Note the use of the rec keyword, which allows to refer to the pattern itself in its implementation.
let rec (|WS|_|) = function
| OneWS (WS s)
| EOL (WS s) -> Some s
| s -> Some s
We could also have attempted another variant, where WS uses itself for the first part, which would implement a left-recursive grammar. Unfortunately, this pattern would be useless, as all it would do is raise stack-overflow exceptions.
let rec (|WS1|_|) = function
| WS1 (OneWS s) -> Some s
| s -> Some s
My variant of the Regex pattern differs a bit from the one in the wikibook, to avoid re-building Regex objects, which has a non-neglectable cost.
let (|Regex|_|) (regex : Regex) = function
| Stream(n, line :: rest) ->
let m = regex.Match(line)
if m.Success then
let lineRest = line.Substring(m.Length)
let values =
[ for gr in m.Groups -> gr.Value ]
Some (values, Stream(n, lineRest :: rest))
else None
| _ -> None
A sample pattern to illustrate the use of the Regex pattern.
let personRegex = Regex("NAME: (\w+) AGE: (\d+)")
let (|Person|_|) = function
| WS (Regex personRegex ([_ ; name ; age], s)) -> Some ((name, age), s)
| _ -> None
A pattern to parse data of a large number of persons. I could have used recursion at the level of the pattern itself, similarly to WS. However, such a pattern would not be tail-recursive, and would cause stack overflows for large numbers of entries.
let (|Persons|_|) s =
let rec work s persons =
match s with
| Person (p, s) -> work s (p :: persons)
| _ -> (persons |> List.rev, s)
let persons, s = work s []
Some (persons, s)
Finally, some utility code to initialize streams, generate one million lines to parse, followed by the parsing itself.
let streamOfLines lines =
Stream(0, lines |> List.ofSeq)
let streamOfFile path =
Stream(0, System.IO.File.ReadAllLines(path) |> List.ofArray)
let stream =
seq { for n in 0..1000000 do
yield sprintf " NAME: Anonymous AGE: %d " (n % 99) }
|> streamOfLines
#time "on"
match stream with
| Persons (p, WS EOF) -> printfn "Success %A" p
| _ -> printfn "Failed"
#time "off"
Parsing takes about 9 seconds on my PC (18s if I recreate the Regex object repeatedly). I think that's very decent performance.
In a future post, I'll describe how to parse Boolean formulas and expand a bit on the power and performance of this method.