Thursday, November 8, 2007

Combinator Parsers and External DSLs

For the past few months I have been investigating the opportunities and challenges of Domain Specific Languages (DSL) in Groovy. One issue surrounding external DSLs (eDSL) is the need to compile the DSL into code that accesses the application API. This often suggests parser generators such as javaCC or Antlr. If they are unfamiliar to the developer they can present a steep learning curve. They also require that additional code for the DSL has to be developed in addition to the application code.

Another consideration would be to use a parser combinator framework such as Jparsec. This is a classic Java implementation of a combinator parser exhibiting all the usual OO characteristics. It is also a very large library with many classes, many methods, etc.

While seeking a solution to some of these issues I also sought one that would exhibit some of the strengths of Groovy. The approach described in [Combinator Parser] exploits Groovy closures. Each closure defines a simple combinator parser that can be combined into more elaborate parsers. The parser is then transformed with yet another combinator into a compiler [External Domain Specific Languages]. Writing a parser/compiler is then no more complex than writing assignments that reflect the DSL grammar. For example:



def factor = altC(identifier, integer) // factor: identifier integer
def term = sepByC(factor, multiply) // term: factor { '*' factor }*
def expr = sepByC(term, add) // expr: term { '+' term }*
Here, the altC combinator parsers alternatives, while the sepByC parses one or more occurences of the first parser separated by the second parser. The sepByC combinator is, in fact, an assembly of the more primitive combinators seqC (a sequence of two parsers) and noneOrMoreC (none or more occurrences of a parser).

Each combinator is defined as a method that returns a closure (the parser). For example, the satisfyC combinator appears as:


def satisfyC = { pred ->
return { inp -> if(pred(inp[0])) ... success ... else ... fail ... }
}
The closure determines if the first item in the input stream satisfies some predicate (a closure with a single parameter, returning a boolean). Observe how the predicate is a 'free' variable in the closure definition and is set by the call to satisfyC. Examples might include:

def isLetter = { ch -> ... is ch a letter? ... }
def isDigit = { ch -> ... is ch a digit? ... }
def letter = satisfyC(isLetter)
def digit = satisfyC(isDigit)

from which it is easy to define:

def identifier = seqC(letter, noneOrMoreC(altC(letter, digit)))
def integer = oneOrMoreC(digit)

Pleasingly, each combinator is only a few lines of Groovy code [
Combinators.groovy]. The complete set is defined in less than 200 LOC! The combinators are generic insofar as the input stream may be a character sequence (a character parser) or a token stream ( a parser).

Currently, I am working to extend the code with support for error reporting. Additionally, I have defined a combinator parser that reads a BNF definition and generates the combinator parser for that grammar - a kind of combinator parser compiler compiler.

I should be pleased to receive feedback from other developers.

Ken Barclay