Recognize This!
Ok… I may not have entirely explained the machinery behind the basic backtracking parser combinators yet, but before I get to far into the heavy stuff I thought it might be useful to discuss some basic context issues involved in writing parsers.
There are two main elements involved in writing parsers… The Why… and the How (to some extent what and when also fit here… these are technical aspect that is).
Why do we write parsers?
There are actually a few reasons, and this isn’t really a particularly interesting question for us (after all, we’re already writing a parsers aren’t we?), but I would like to briefly recap what I believe are important points to keep in mind when designing a parser combinator framework.
There are two main varieties of parsers, which come in the form of prototypes and production parsers. Regardless of what the parser is being used for, the purpose of the parser is either to support another process, or it is simply to test some linguistic nature of a computer language (concrete syntax typically).
Of these two categories, we also stumble over a hybrid variation, where a parser written for production use may end up serving a prototype role and, of course, prototype parsers may end up evolving into production parsers.
The importance of this categorization is the implications of generally knee-jerk decisions associated with each approach. The main decisions we need to worry about ourselves when designing a parser framework are of course the error reporting and recovery mechanisms, basic performance concerns, and to some extent the language and tool support.
This isn’t just limited to our intended audience either, but also to ourselves as parser implementors… its all to common to consider why we are writing a parser and select a methodology based purely on practical concerns, and yet, quite often the very notation we use to write the parser is more important. The notation can have very severe effects on the cognitive processes involved in writing a parser or computer language, and so too can the tools that support that notation.
How do we implement parsers?
This part is a little simpler to describe at a high level… that and we’ll be delving into the issues more closely later on anyway.
I personally like to break the implementation concerns into three major categories… recognition, state, and output.
Each of these can become extraordinarily complex and intertwined, but they always tend to follow the same basic trade-offs which we’ll discuss briefly here.
Recognizing input is possibly the most obvious element of parser construction, since it tends to follow the design of the language anyway. More specifically it tends to follow the formal definition of a language, and most conventional notations use the same form as the BNF or PEG formal notation.
While it might seem useful to apply the same notation as the formal definition of a language, imo its not particularly friendly in practice. The main issue with this trade off is the difficulty in implementing non-trivial concerns like left-factoring, lookahead, error reporting/recovery or simply the fact that they don’t contain any proper form of abstraction (no state information or context sensitivity).
This the actually the reason we’re using a combinator approach… it allows us to capture the conceptual elements involved rather than directly writing a whole lot of repetitive rules, simply because the parser generator tool we selected, didn’t realize that scanners aren’t the only type of repetitive task in language writing (expression/operator grammars for example)… or even the lowly delimited list based on non-conventional parameter parsers.
This might actually be a good time to point out the ANTLR parser generator as an example of the kind of trade of we’re trying to avoid. I can’t stress enough how often I’ve wished I could refactor part of a parser involving delimited lists, enclosed parsers, associativity and precedence rules, and it just gets worse from there. The problem? No higher-level abstraction of rules (its like programming a user interface with 1st order logic… well… maybe not quite that bad but you get the idea).
Collecting state is the second important consideration when parsing a computer language. It might be interesting to consider what kind of state you’re accumulation however. In a calculator for example, the state is actually the current number and possibly a bindings table for variables if you allow them, but for a C parser you also need the symbol table and possibly even some other annotations for the preprocessor phase.
The point I’m trying to make here is not that parsers must return something… we’ll get to that in a second… the point is that many parser have other state that is needed by the other software components that receive the output form. This state obviously cannot be collected directly as a result, and yet it also has to be collected directly as a result of parsing.
Why do I care so much about these side effects? Its a perfect opportunity to reflect on Parsec’s usage of monads… and why we don’t need them!
Yup… that’s right, monads primarily capture a few things, but the main elements are timely evaluation (not necessarily eager), imperative ordering of evaluation, and propagation of out-of-band state information. To paraphrase… side effects and imperative code blocks. (Monads are actually capable of quite a bit more than this, but in our case, none of that behaviour is relevant.)
Actually we do still need to pass state, but whether we do it globally (uh…), or through an accumulator doesn’t matter… we have a different way of writing side effects, and we’ll use that for them. Scheme also provides the begin notation for us, so we have a perfectly viable means of writing imperative code if we need it, but even in Parsec it’s rarely a necessity.
Finally we come to the Output form, which is almost exclusively broken down into Abstract Syntax Trees (ASTs), Compilers (intermediate or machine code output), Simple Recognizers, Translators (Scheme to C for example), and Interpreters. Of these only ASTs are used when we have reasonable levels of computation power available, while the other forms are typically relegated to AST parsers or in our case, the native pattern matching and program code of the host language.
The big challenge we’re facing here however is actually related somewhat to the difference between prototype and production parsers. If we wish to provide good support for prototyping, we have to realize that its not entirely uncommon for interpreters to be written without an accompanying AST representation. Unless we can make the AST representation so inexpensive to generate and extraordinarily useful, we may wish to provide support for both ASTs and interpreters.
Whatever trade-off we make however, as long as we realize that we need to effectively support all of the possible applications unless we expect programmers to also use another parser framework or generator. Not necessarily perfectly, but at least adequately (we need to at least make generating ASTs cheap so that programmers can work with them even if the other forms are not simple to directly implement).
Before I finish I’d also like to add a small note about how parsers are actually used… or more specifically that the tools we provide for testing, debugging and maintaining our parsers are possibly more important than the ability to write the parsers in the first place.
For those who have used ANTLR before, you probably realize what I’m talking about here… yes the ANTLR grammars are limited, and painful to effectively refactor, when half the time you’re only even generating ASTs anyway. But the grammar debuggers and generally the ANTLR Works IDE are exceptional tools for constructing parsers. Also… I don’t mean to pay out the ANTLR parser generator… it really is the best tool I’ve used with Java or C++… its just not as powerful as I believe we can get in Scheme
While we can simplify the construction of most parsers greatly, it would be even better if we could also provide dedicated tooling for profiling, debugging, inspecting or even just testing our parsers. To this end, I’ll be adding various tooling issues interspersed with the construction of various parser facilities as we construct hopefully the canonical Scheme parser framework… or at least make the framework as useful as possible.
I don’t intend on covering any of this again in future posts in the series. Instead we’ll only be investigating the trade offs relevant for a Scheme parser combinator framework
– Lorenz