Thursday, November 8, 2007

Combinator Parsers and External DSLs

For the past few months I have been investigating the opportunities and challenges of Domain Specific Languages (DSL) in Groovy. One issue surrounding external DSLs (eDSL) is the need to compile the DSL into code that accesses the application API. This often suggests parser generators such as javaCC or Antlr. If they are unfamiliar to the developer they can present a steep learning curve. They also require that additional code for the DSL has to be developed in addition to the application code.

Another consideration would be to use a parser combinator framework such as Jparsec. This is a classic Java implementation of a combinator parser exhibiting all the usual OO characteristics. It is also a very large library with many classes, many methods, etc.

While seeking a solution to some of these issues I also sought one that would exhibit some of the strengths of Groovy. The approach described in [Combinator Parser] exploits Groovy closures. Each closure defines a simple combinator parser that can be combined into more elaborate parsers. The parser is then transformed with yet another combinator into a compiler [External Domain Specific Languages]. Writing a parser/compiler is then no more complex than writing assignments that reflect the DSL grammar. For example:



def factor = altC(identifier, integer) // factor: identifier integer
def term = sepByC(factor, multiply) // term: factor { '*' factor }*
def expr = sepByC(term, add) // expr: term { '+' term }*
Here, the altC combinator parsers alternatives, while the sepByC parses one or more occurences of the first parser separated by the second parser. The sepByC combinator is, in fact, an assembly of the more primitive combinators seqC (a sequence of two parsers) and noneOrMoreC (none or more occurrences of a parser).

Each combinator is defined as a method that returns a closure (the parser). For example, the satisfyC combinator appears as:


def satisfyC = { pred ->
return { inp -> if(pred(inp[0])) ... success ... else ... fail ... }
}
The closure determines if the first item in the input stream satisfies some predicate (a closure with a single parameter, returning a boolean). Observe how the predicate is a 'free' variable in the closure definition and is set by the call to satisfyC. Examples might include:

def isLetter = { ch -> ... is ch a letter? ... }
def isDigit = { ch -> ... is ch a digit? ... }
def letter = satisfyC(isLetter)
def digit = satisfyC(isDigit)

from which it is easy to define:

def identifier = seqC(letter, noneOrMoreC(altC(letter, digit)))
def integer = oneOrMoreC(digit)

Pleasingly, each combinator is only a few lines of Groovy code [
Combinators.groovy]. The complete set is defined in less than 200 LOC! The combinators are generic insofar as the input stream may be a character sequence (a character parser) or a token stream ( a parser).

Currently, I am working to extend the code with support for error reporting. Additionally, I have defined a combinator parser that reads a BNF definition and generates the combinator parser for that grammar - a kind of combinator parser compiler compiler.

I should be pleased to receive feedback from other developers.

Ken Barclay

6 comments:

Alex said...

Hi Ken, nice work!
Here are some comments from my side:

Closures are fine, but is there a real benefit from closure usage instead of members of the Combinators class?

As far as I understand,


static satisfyC = { pred -> ...


is equivalent to the


static satisfyC(pred){...} function


Some thoughts on the gparsec -

You've mentioned the powerful jparsec lib, and the gparsec that you've started is like a groovy equivalent of the jparsec, right?
I've just thought of Groovy nature - Groovy can really benefit from re-using Java libs, so probably a Groovy binding to the JParsec can be a good idea, what do you think?

It would be really great to add some syntax sugar to JParsec by introducing some Groovy classs with operators defined that will build JParsec parsers as the result, like that will be perfect:


def group = '(' >> expression >> ')';

def factor = integer | group;

...

where integer is an instance of the parser that accepts integers and has the rightShift operator defined


Thanks,
Alex

Ken Barclay said...

Alex

Thanks for this. Yes, a method is equally valid. My first implementation used methods. I switched to closures, partly to give consistency, but also as objects I can curry them, pass as parameters, put in lists, etc.

I am now working on the implementation to provide better error reporting. It also includes code with overloaded operators for alt(|), seq (+) and (n)oneOrMore (*). This works fine.

Ken

Anonymous said...

Hi Ken

What's the licence for your linked-to GParsec code?

I want to embed it in some Apache 2.0 licensed code I'm writing. Is that a problem?

Thanks, Gavin

Ken Barclay said...

Gavin

There is no licensing associated with gparsec. You are free to use it as-is.

I will shortly issue a new release. This includes the tryC combinator which can be used to handle ambiguous grammars.

Ken

kelheste said...

Mr. Barclay,

I purchased your Groovy Programming book and it made mention to the following website: http://www.dcs.napier.ac.uk/~kab/groovy/groovy.html. The websites were to contain working scripts of examples, case studies and end-of-chapter exercises. The site is no longer valid. Any chance I could obtain this information elsewhere?

kelle.hester@ttu.edu

Yao said...

Hi Barclay,

The link to externalDSL.zip seems broken.
Can I found that file anywhere?