Sunday, March 16, 2008

Groovy Combinator Parsing

Groovy Combinator Parsing
Part 4: Abstract Syntax Trees


Associated with parsing is the simultaneous construction of an abstract
syntax tree (AST). In this final article we will investigate how this
is achieved with combinator parsers.

When a parse succeeds the success object carries a result value that
represents the recognised element (together with the remaining input
to be parsed). In the case of the letter parser shown in part 1, the
recognised element is the single character:

assert letter('abc###') == ... success with 'a' as the result ...

Equally, the parser:

def letterAndDigit = seqC(letter, digit)
assert letterAndDigit('a123###') == ... success with ['a', '1'] as the result ...

Because we are using a seqC combinator, it returns a list of two elements
produced by the two parsers.

In part 2 we saw the parser for a Groovy identifier:

def identifier = seqC(identifierStart, noneOrMoreC(identifierRest))

where the permissible characters that may start a Groovy identifier are
defined by the parser identifierStart, and all subsequent characters by
the parser identifierRest. The noneOrMoreC combinator will produce a
list of the result characters:

assert identifier('bbc2###') == ... success with ['b', ['b', 'c', '2']] ...

Here, the outer list is produced by seqC and the inner list through noneOrMoreC.
Normally, a lexer would wrap these individual characters into some token
that denotes the completed identifier. Consider that we have such a class:

class Identifier {
def identifier // String
}


The combinator mapC transform the value of a parser p into a different value:

def mapC = { p, f -> ... transform the result of parser p by f ... }

The parameter f is the closure that performs the transformation. Given:

def toIdentifier = { chList -> return new Identifier(identifier: chList[0] + chList[1].join()) }
def identifier = mapC(seqC(identifierStart, noneOrMoreC(identifierRest)), toIdentifier)
assert identifier('bbc2###') == ... success with Identifier object ...


A larger example would be the Groovy while statement:

whileStatement: "while" LPAREN strictContextExpression RPAREN compatibleBodyStatement


The corresponding definition is:

def whileStatement = mapC(seqCs([whileKeyword, lParen, expression, rParen, compatibleBodyStatement]), toWhileStatement)


with the mapping transformer constructing an object of the class WhileStatement:

class WhileStatement {
def expression
def body
}






Further Developments

The combinators described in this series of articles have been stable since the
summer of 2007. They have allowed me to develop a number of simple external DSLs.

Since a BNF grammar defining some language has a fixed structure, then it is
also possible to define combinators that will parse BNF production rules. Here
the BNF rules act as the external DSL. Action code added to the parser can
produce the combinators for the grammar automatically. In effect we have a
simple compiler-compiler.

Work is underway to produce a variant of the current combinators to give
support to error reporting when a parse detects an input error. This has
involved four small classes and relatively minor changes to the combinators.
This new work is presently being tested.

I am also investigating how the combinators scale when defining parsers for
large grammars. Specifically, I am using the combinators to produce parsers
for large subsets of Groovy. Currently I have successfully created parsers for
expressions, statements, declarations, class and interface definitions. I
anticipate being able to parse all the Groovy language, which will then be
used for various tools.

The work, however, is still very much a work-in-progress. For example, results
from developing a parser for a large grammar has revealed opportunities for
optimisations. So expect to see continual enhancements of the library.
The sources for this work will be posted shortly on the Groovy site. Users are welcome
to experiment and use the library. Feedback would be welcomed.

Ken Barclay

Sunday, March 9, 2008

Groovy Combinator Parsing

Groovy Combinator Parsing
Part 3: Ambiguous Grammars

The previous article introduced the basic combinators provided by theGParsec library. We showed how they can be combined and how they define parsers that mirror the production rules of the grammar.

We regularly find that our production rules are mutually dependent. For example, with Groovy we have:

whileStatement : 'while' '(' expression ')' compatibleBodyStatement
ifStatement : ...
statement : whileStatement ifStatement ...
blockBody : ( statement separators )*
compoundStatement : '{' blockBody '}'
compatibleBodyStatement : compoundStatement statement





In this extract the compatibleBodyStatement is used to define a whileStatement. The compatibleBodyStatement is either a compoundStatement or a statement, and the latter is any of the permissible statements, including the whileStatement.

To define a Groovy parser for a whileStatement, we would first have to define a parser for compatibleBodyStatement. This would require a parser definition for statement which would, in turn, require whileStatement.

Mutual dependencies are readily handled using the lazyC combinator:


def lazyC = { p ->
return { inp ->
return p[0](inp)
}
}


Here p represents a proxy for the actual parser, held as the first element of a list. Our productions are now defined in Groovy as:


def compatibleBodyStatementProxy = []
def whileStatement = seqCs([whileKeyword, lParen, expression, rParen, lazyC(compatibleBodyStatementProxy)])
def ifStatement = ...
def statement = altCs([whileStatement, ifStatement, ...])
def blockBody = noneOrMoreC(seqC(statement, separators))
def compoundStatement = seq3C(lBrace, blockBody, rBrace)
def compatibleBodyStatement = altC(compoundStatement, statement)
compatibleBodyStatementProxy << compatibleBodyStatementProxy



that can then be used to define a whileStatement. At the end of the code we place the actual compatibleBodyStatement into the proxy.


Ambiguous Grammars

Many grammars are ambiguous and require an arbitrary lookahead to determine the correct alternative in a production rule to follow. For example, Groovy statements can begiven an identifying label. Potentially, that identifier may be a label on a statement or an identifier used on the left of an assignment statement. The Groovy production rule is:


statement : whileStatement ifStatement ... labelledStatement
labelledStatement : identifier ':' statement statement


The tryC combinator behaves like its parser parameter, but pretends its has notconsumed any input when it fails. Consider the parser altC(tryC(p), q). Even when the parser p fails while consuming some input, the choice combinator will try thealternate q. The resulting Groovy definitions are:


def statement = altCs([whileStatement, ifStatement, ..., labelledStatement])

def labelledStatement = altC(tryC(seq3C(identifier, colon, statement)), statement)





Compiling

In the final installment we shall consider how we can adapt our parsers to construct abstract syntax trees to represent the parsed input. We will also look at further developments of this work.

Ken Barclay


Sunday, March 2, 2008

Groovy Combinator Parsing

Groovy Combinator Parsing

Part 2: Combinators

The first article introduced the satisfyC combinator. From it we are able to generate any number of parsers by providing a predicate test closure. For example, we were able to define simple parsers that recognised a letter or a digit in the inputstream.

The GParsec library comprises a number of such combinators. They are summarised in the table below.

satisfyC: The satisfyC combinator consumes a single input when the predicate succeeds.

altC (alt3C, altCs): is the choice combinator. Given two parsers it only looks at its second alternative if the first has not consumed any input - regardless of the final value. The two variants work the same way, alt3C is given three parsers and altCs a list of parsers.

seqC (seq3C, seqCs): is the sequencing combinator. It runs two parsers in succession and if successful, returns the result of the two parsers. Again, seq3C and seqCs are simple variants.


noneOrMoreC, oneOrMoreC: The combinator noneOrMoreC applies a parser zero or more times to an input stream. The result from each application of the parser are returned in a list. Not surprisingly, the combinator oneOrMoreC is used for non-empty sequences.

optionalC: The combinator optionalC may succeed in parsing some input. It always returns success.


Many of these combinators are programmed in a manner similar to combinator satisfyC. Some are defined in terms of the others. For example we might define:


def seq3C = { p, q, r -> return seqC(seqC(p, q), r) }
def oneOrMoreC = { p -> return seqC(p, noneOrMorec(p)) }


These definitions reveal how we might combine the combinators to construct useful parsers. For example, given the definition for the permissible characters that may start a Groovy identifier as identifierStart, and all subsequent characters by the parser identifierRest, then we might define the parser for an identifier as:


def identifierStart = altC(letter, charP('_'))
def identifierRest = altC(identifierStart, digit)
def identifier = seqC(identifierStart, noneOrMoreC(identifierRest))


Other simple parsers are:


def digits = oneOrMoreC(digit)
def integer = digits
def decimal = seqC(digits, seqC(period, digits))


A key characteristic of combinator parsers is they give rise to parser definitions that mimic the production rules of the grammar. For example, a variable definition in Groovy comprises one or more variable declarators separated by commas. A variable declarator is an identifier with an optional initializer. The productions are:


variableDeclarator : identifier ( '=' expression )?
variableDefinitions : variableDeclarator ( ',' variableDeclarator)*


Assuming we have parsers for single characters, identifier and expression, then we define the parsers as:


def variableDeclarator = seqC(identifier, optionalC(seqC(assign, expression)))
def variableDefinitions = seqC(variableDeclarator, noneOrMoreC(seqC(comma, variableDeclarator)))


Observe the one-to-one correspondence between the production rules and the parser definitions.


Problem grammars

We have shown how we define parsers using the range of combinators provided by GParsec. However, some grammars present problems as a result of including ambiguities. This is the subject of our next article.

Ken Barclay