Wednesday, February 27, 2008

Groovy Combinator Parsing

Groovy Combinator Parsing
Part 1: Combinator Parsing

This work was inspired by the many recent articles on Domain Specific Languages (DSLs). Most authors when describing DSLs consider only internal DSLs i.e. those embedded in a host language such as Groovy and exploiting the meta-programming facilities. Commonly, external DSLs are given little coverage due to the perceived difficulties. Generally the problems are due to external DSLs requiring input from compiler technology. For most application programmers this is considered a specialised area and requiring a steep learning curve.

Most DSL articles refer the reader to generator parsers such as Lex/Yacc or Antlr. Undoubtedly these are specialist tools that will be unfamiliar to most application developers.

In contrast to parser generators, the functional programming community have developed combinator parsers. They offer a set of combinators to express grammars. These combinators are manipulated as first class values and can be combined to define new combinators for the application. One advantage of combinator parsers is they avoid the integration of differing compiler tools.

The JParsec library offers a Java implementation for a combinator parser. This is a high quality product with some excellent features. However, it is a Java implementation constructed in a classical OO manner. It features a number of classes, some quite complex with a large number of methods.

The GParsec library described in this series of articles demonstrates the construction and application of a combinator parser library based on first principles. Further, in developing the library I showcase some of the excellent features of Groovy.

A Simple Combinator

A common requirement of a parser is to ensure the next part of the input satisfies some grammar rule. For example in Groovy we might expect an identifier to follow the class keyword in a class declaration. The combinator satisfyC is used for this purpose. It is developed as a Groovy closure and has the form:

def satisfyC = { pred ->
return { inp ->
if(inp.size() == 0)
return ... fail ...
else if(pred(inp[0]))
return ... success ...
return ... fail ...

The satisfyC closure accepts a single parameter and returns a closure object.
The returned closure is the required parser. Observe how calling this closure with differing actual parameters will deliver a family of different parsers.

The pred parameter of closure satisfyC is used in the definition of the returned closure. This is sometimes known as a free variable. It is bound to the actual parameter when satisfyC is called and is the reason for the differing parsers that can be delivered by this one closure definition.

The pred parameter of satisfyC is a predicate closure, one that is applied to a single parameter and returns a boolean result. It is used to test the first item in the input stream, denoted by the inp parameter. If the test succeeds then the parser reports success otherwise it reports failure when the test fails or when it encounters the end of the input.

The success and failure of a parse is captured by simple objects. A successful parse will generate and object that has the result of the parse and the remaining input as its state.

We can start to produce simple character parsers with satisfyC:

def isLetter = { ch -> return Character.isLetter(ch) }
def isDigit = { ch -> return Character.isDigit(ch) }
def letter = satisfyC(isLetter)
def digit = satisfyC(isDigit)
assert letter('abc###') == ... success with bc### still in the input stream ...
assert letter('123###') == ... fail ...
assert digit('123###') == ... success with 23### still in the input stream

Where the parse succeeds, then the character that has been recognised is maintained
as the result by the success object. The remainder of the input is also maintained
by the success object. This remaining input is then made available for processing
by another parser.

The GParsec Library

The library comprises a number of such combinators. Many are no more complex than satisfyC. Some are simply combinations of others. The complete library is only some 200 LOC! In the second article in this series we shall consider these combinators and how they are combined to construct useful parsers.

Ken Barclay

No comments: