Parsers and Combinators

Last time I showed some brief examples of what a parser combinator is, how they work and how I intended to use them to solve the parsing problem using the Scheme programming language. This time we get to start the interesting stuff…

To get things going, last time we looked at a simple character facility which returns a parser that will only accept that particular character. We can also use this technique to select from a range of characters.

(define (char-range start end)
  (λ (state)
    (let ([rc (read-char state)])
      (cond ((eof-object? rc) fail)
            ((and (char>=? rc start)
                  (char<=? rc end))
             rc)
            (else fail)))))

We might also wish to repeat the actions of a parser, which involves writing a proper combinator. Actually, this is the first real utility parser we'll see and it simply allows us to infinitely repeat the parser provided.

(define (many p)
  (letrec ([mp (λ (state acc)
                 (aif result fail? ((with-mark p) state)
                      acc
                      (mp state (cons result acc))))])
    (λ (state)
      (reverse (mp state '())))))

This defines a small function that accumulates a list (backwards as usual) by repeatedly applying the parser (p). The with-mark stuff is the second combinator we'll look at, which is used here to prevent the stream position from moving beyond the last match.

The description of many would be a combinator that continues to apply a parser until it fails, reverting the parser's state to the state prior to running the final failing application of the parser.

The definition of with-mark is even simpler (or should this be with-state?).

(define (with-mark parser)
  (λ (state)
    (let ([mark (file-position state)])
      (aif v fail? (parser state)
           (begin (file-position state mark)
                  v)
           v))))

Our parser state is of course still non-functional, based simply on the port capabilities built into Scheme. We check to make sure the parser doesn't fail, and if it does we reset the parsers state by recording it before hand. Non-functional user state can be stored in a larger closure/environment, and this could be changed to support functional state as a second variable (in a data structure or list), but I'll leave to a later post, or the reader of course.

Running our parser is actually this simple (from the repl).

(define digit
  (char-range # #9))

(define state (open-input-string "12345"))

((many digit) state)

There is one slight problem with this... It accepts any number of digits... including zero digits! I've provided a second version called many1, but you could actually define it like the following snipped. (Note: See the bottom of this post to find the source code as it currently appears).

Ok, we still need a way to parse whitespace, and to abstract the tokens of whatever language we're implementing. To help us here we can write a whitespace parser and a token combinator which will help us.

(define ws-char
  (λ (state)
    (let ([rc (read-char state)])
      (cond ((eof-object? rc) fail)
            ((char-whitespace? rc) rc)
            (else fail)))))

(define whitespace
  (many1 ws-char))

(define (wstok tparser)
  (mand (opt whitespace)
        tparser))

I'd just like to point out before we more on... the ws-char parser here actually recognizes any Scheme whitespace character. I could have written this parser to recognize some other whitespace character set but I felt that this would serve us better for now, since we might eventually want to test our method against the existing Scheme language, or at least a sizable subset.

Again, using this is quite simple... to recognize a number token we can simple define the number token parser as (wstok (many1 digit)).

Now we still have one minor issue here... what if we don't want to simply recognize the language, but also produce some direct result. For example, our parser so far can recognize number tokens, but it doesn't return those numbers... just the lists of characters that make them up! (Without the whitespace of course.)

(define t-number
  (mlet nstr (wstok (many1 digit))
    (! (string->number (list->string nstr)))))

This is where the magic begins... mlet and ! are just macros which expand to scheme let and raw expressions wrapped in lambdas which require the current parser state. mlet then also takes a symbol to bind, and two parsers, which is why we need the ! macro magic. Remember that parsers are just lambdas that take a single parameter... it doesn't matter if they don't use that parameter, or just pass it on (like mlet).

It might also be interesting to note that we could write a combinator that took a function that accepted the results of the parser, and produced a replacement result. Technically this parser is actually very much like the one we just wrote, except that it can only filter the results of a parser, and not the results of several parsers which can only be achieved though nested mlet expressions. Usage might look similar to the following.

(define t-number
  (return nstr (wstok (many1 digit))
    (string->number (list->string nstr))))

Pretty close right.

By this point I thing most of you are starting to see what we're getting to... the canonical example of an expression parser, but this time I'm actually going to add a slight twist to the story instead. We're actually going to generate Scheme code instead, and then use the standard eval function instead. I know this might seem like a strange thing to do, especially since we don't normally use eval in most Scheme code, but think forward a little. Normally you wouldn't directly evaluate the code you just parsed either... think of eval as our interpreter, operating on an abstract syntax tree... conveniently Scheme code is already very much like an AST so we just use that.

(define t-add
  (return _ (wstok-char #+)
    '+))

This gives us our addition token returning, obviously, the Scheme + operator. And finally two more pieces to put it all together.

(define (left-assoc term op)
  (letrec ([f (λ (state acc)
                (aif lst fail? ((with-mark (all op term)) state)
                     acc
                     (f state
                        (list (car lst) acc (cadr lst)))))]) ;; (op lhs rhs)
    (mlet result term
      (λ (state)
        (f state result)))))

(define expr
  (left-assoc t-number t-add))

Finally, to end this... rather lengthy... example. Here is a finished REPL using our parser...

(define (expr-repl)
  (display ">> ")
  (let ([line (read-line)])
    (if (string-ci=? "exit" line)
        '()
        (begin
          (display " = ")
          (display (eval (expr (open-input-string line))))
          (newline)
          (expr-repl)))))

Calling it is pretty simple, and we can get results like so...

> (expr-repl)
>> 1
 = 1
>> 1+2+3+4
 = 10
>> 1 + 2 + 3 + 4
 = 10
>> exit
()

The purpose of this simple example is to show you how our combinators and point out what, in my opinion, is possibly the most important observation to make about parser combinators... all our combinators return parsers and all our parsers are not combinators. Yup, you always have to realize when you're looking at a combinator, and when your looking at a parser. Hell, you could even be looking at a combinator that returns a combinator! (currying et cetera)

Also, in the code I've used above, I made a point of following a coding convention that separates top-level parser definitions by using lambdas, and combinators or simple functions which use the define syntax instead. I found that this makes it easier to visually distinguish between the two, without having to resort to an explicit naming convention.

Next time we'll move beyond this simple stuff and I'll try to investigate more complicated combinators that don't take parsers but data structures to construct their resulting parser.

-- Lorenz

  1. [Scheme] Monadic Parser Combinators – Haskell 風パーサー・コンビネーターの実装…

    Abstract for English readers: This article describes an implementation of a purely functional, monadic parser combinator library in PLT Scheme. With this library, one can easily build non-ambiguous, recursive-descent style parsers for string of charac…