Scheme Parser Combinators

I’ve had my MMeta parsing library on hold for a while now, but I think its finally time to reinvestigate the idea.

There are actually two reasons for doing this… The first is rather obvious… there is still a vacuum of good parser libraries for Scheme atm. But the other might be less so (uh… unless you read the title)… Scheme Parser Combinators.

After investigating parser combinators for a little while now, a short while ago I had the good fortune to stumble on a post by Gilad Bracha about parser combinators (this time its in Smalltalk).

While not everyone reading this will be able to easily follow the Smalltalk code involved in the post, its possible one of the best descriptions of what a combinator parser is, and how they tend to get written.

My goal here is to replace the forreign syntax of OMeta with a Scheme native syntax using plain old functions and namespace bindings. And of course there’s the odd macro to help with readability (combinators can occasionally get a little out of hand without proper currying, or an unusually light weight syntax for writing lambdas/closures).

What this means is that my parsing solution will now look a bit more like a set of constructors nested between each other being assigned to a variable. Its not quite as pretty as the Haskell variation which uses the parsers directly (thanks to lazy evaluation and currying), but it does work in a reasonably straight forward way.

(define basic-char
  (λ (state)
    (aif rc eof-object? (read-char state)
         fail
         rc)))

(define (char c)
  (λ (state)
    (let ([rc (read-char state)])
      (cond ((eof-object? rc) fail)
            ((char=? rc c) rc)
            (else fail)))))

In the above snipped we define the two fundamental character parsers… They essentially define the naive variation of a character parser and a selective character parser.

What I want you to do is note the second parser’s definition. This parser is not technically a parser at all, and is what we generally refer to as a combinator. In reality its not really a combinator since it doesn’t accept a function as an argument… but it is higher order since it does return one, and that’s they key to making these things work.

All you need to drive a parser combinator is to be capable of returning functions until you have a function with all but one argument… plain and simple currying. Of course because we’re using Scheme and not Haskell, currying for us is not implicit, so rather than doing all of it by hand we’d rather just have combinators which have not been parameterized yet, and actual parsers which only accept a single state argument (a Scheme input port currently).

This does make some of the parser we might implement a little more complicated, since some of our functions will return a parser, and some of our functions will be the parser… but overall it tends to work quite effectively once you understand the difference between the two. It might also help to note that currying is used to prepare functions early, but rather it tends to be on the fly. Occasionally you may even return a parser immediately before running the very same parser (within the same top level expression).

Since I’m looking predominantly at backtracking parsing still lets now investigate a useful practical example to introduce some more combinators and the basic style that you use when writing parser with this stuff…

(define a-or-b
  (many1 (mor (char #a)
              (char #b))))

Ok then… what do we have here. (char #a) and the equivalent for #b are simple parsers which only accept the plain and simple input of their respective character, or they fail.

This might be a good point to explain what the fail value actually is… its not a value for a start. Actually its a clever little MzScheme macro (not R5RS unfortunately), which allows you to either write just fail and it will expand to (make-fail-type ()), or you could write (fail "Did not find character #c!")… which is what we’ll be using when we cover error handling in a later post. Of course there’s also a function for testing it called fail?.

The mor here is of course the or branch (I may come up with a better name for this at some point), which in our case performs any stream rewinding necessary and checks each path until one of them succeeds or the entire construct fails. The parser returned by mor here will actually parse a single #a or a single #b or it will fail.

And finally the many1 accepts a parser which it will run over and over again collecting the results in to a list based accumulator. If it doesn’t see at least one result it will return a fail result, otherwise it will return the list of results. Interestingly this parser that many1 accepts must fail at least once before many1 itself will stop, which means that many1 must always rewind the stream at least a little before returning a successful result.

And of course we use the standard define to set a-or-b…

To finish up this post, our completed parser can be evaluated just like a hand-written parser in Scheme can…

(a-or-b (open-input-string "aabb"))

This will of course return (#a #a #b #b).

Of course this is really just a brief introduction on Parser Combinators and they’re basic operation… in the next post for MMeta I’ll be discussing some actual usage of the parser and of course, attached will be the actual source files for the parser ready to go.

– Lorenz

  1. > (read-char state) Is this function side-effecting? If so, I don’t see how you implement backtracking. I implemented parser combinators in Scheme a while back which you can read about [here](http://www.reddit.com/info/6bls2/comments/). I’ve since improved the code a lot and re-wrote to use CPS instead of multiple value return (speeding up 2-3x and simplifying the code) but I’ve yet to write about it.

  2. Since most parsing operations are not performed in parallel, I believe the side effects shouldn’t be too much of a problem (this is more of an eager evaluation and parser combinators experiment anyway).

    The way I expect it to work in practice is akin to a calling convention. All of these methods ruthlessly cause side-effects (on the stream ONLY). This means that any parser that wishes to have the stream state moved back must save the old stream state and forward a copy of it.

    I haven’t worked out how many side effects will be allowed when user state is added into the mix, but for now there are more important problems to solve first.

    It might be somewhat amusing that a previous incarnation I built did actually use CPS, but I found the notation so cumbersome when implementing the parsers one’s self (which I believe is the real power of this approach to parser building), that I dropped it in favor of properly fast algorithms like predictive parsing (which I’ll get to in a later article).

    edit: Actually… I’d be interested to find out how easy you found working with the CPS version for that matter…

    – Lorenz

  1. There are no trackbacks for this post yet.