Sunday, March 2, 2008

Groovy Combinator Parsing

Groovy Combinator Parsing

Part 2: Combinators

The first article introduced the satisfyC combinator. From it we are able to generate any number of parsers by providing a predicate test closure. For example, we were able to define simple parsers that recognised a letter or a digit in the inputstream.

The GParsec library comprises a number of such combinators. They are summarised in the table below.

satisfyC: The satisfyC combinator consumes a single input when the predicate succeeds.

altC (alt3C, altCs): is the choice combinator. Given two parsers it only looks at its second alternative if the first has not consumed any input - regardless of the final value. The two variants work the same way, alt3C is given three parsers and altCs a list of parsers.

seqC (seq3C, seqCs): is the sequencing combinator. It runs two parsers in succession and if successful, returns the result of the two parsers. Again, seq3C and seqCs are simple variants.

noneOrMoreC, oneOrMoreC: The combinator noneOrMoreC applies a parser zero or more times to an input stream. The result from each application of the parser are returned in a list. Not surprisingly, the combinator oneOrMoreC is used for non-empty sequences.

optionalC: The combinator optionalC may succeed in parsing some input. It always returns success.

Many of these combinators are programmed in a manner similar to combinator satisfyC. Some are defined in terms of the others. For example we might define:

def seq3C = { p, q, r -> return seqC(seqC(p, q), r) }
def oneOrMoreC = { p -> return seqC(p, noneOrMorec(p)) }

These definitions reveal how we might combine the combinators to construct useful parsers. For example, given the definition for the permissible characters that may start a Groovy identifier as identifierStart, and all subsequent characters by the parser identifierRest, then we might define the parser for an identifier as:

def identifierStart = altC(letter, charP('_'))
def identifierRest = altC(identifierStart, digit)
def identifier = seqC(identifierStart, noneOrMoreC(identifierRest))

Other simple parsers are:

def digits = oneOrMoreC(digit)
def integer = digits
def decimal = seqC(digits, seqC(period, digits))

A key characteristic of combinator parsers is they give rise to parser definitions that mimic the production rules of the grammar. For example, a variable definition in Groovy comprises one or more variable declarators separated by commas. A variable declarator is an identifier with an optional initializer. The productions are:

variableDeclarator : identifier ( '=' expression )?
variableDefinitions : variableDeclarator ( ',' variableDeclarator)*

Assuming we have parsers for single characters, identifier and expression, then we define the parsers as:

def variableDeclarator = seqC(identifier, optionalC(seqC(assign, expression)))
def variableDefinitions = seqC(variableDeclarator, noneOrMoreC(seqC(comma, variableDeclarator)))

Observe the one-to-one correspondence between the production rules and the parser definitions.

Problem grammars

We have shown how we define parsers using the range of combinators provided by GParsec. However, some grammars present problems as a result of including ambiguities. This is the subject of our next article.

Ken Barclay


Dierk said...

I guess it would be possible to come up with a nice DSL that allows to come even closer to a declarative description without sacrificing the power of the concept. How about the + operator for seqC or || for alternatives? Optionals could be written as
(0..1) * token (Range times).

a = a + (0..1) * b
c = a ¦¦ (1..many) * b


Ken Barclay said...


Thanks for this. I have already done some experimental work using + for sequencing, | for alternatives and * for many. Probably needs futher refinements but should be possible.