Part 4: Abstract Syntax Trees
Associated with parsing is the simultaneous construction of an abstract
syntax tree (AST). In this final article we will investigate how this
is achieved with combinator parsers.
When a parse succeeds the success object carries a result value that
represents the recognised element (together with the remaining input
to be parsed). In the case of the letter parser shown in part 1, the
recognised element is the single character:
assert letter('abc###') == ... success with 'a' as the result ...
Equally, the parser:
def letterAndDigit = seqC(letter, digit)
assert letterAndDigit('a123###') == ... success with ['a', '1'] as the result ...
Because we are using a seqC combinator, it returns a list of two elements
produced by the two parsers.
In part 2 we saw the parser for a Groovy identifier:
def identifier = seqC(identifierStart, noneOrMoreC(identifierRest))
where the permissible characters that may start a Groovy identifier are
defined by the parser identifierStart, and all subsequent characters by
the parser identifierRest. The noneOrMoreC combinator will produce a
list of the result characters:
assert identifier('bbc2###') == ... success with ['b', ['b', 'c', '2']] ...
Here, the outer list is produced by seqC and the inner list through noneOrMoreC.
Normally, a lexer would wrap these individual characters into some token
that denotes the completed identifier. Consider that we have such a class:
class Identifier {
def identifier // String
}
The combinator mapC transform the value of a parser p into a different value:
def mapC = { p, f -> ... transform the result of parser p by f ... }
The parameter f is the closure that performs the transformation. Given:
def toIdentifier = { chList -> return new Identifier(identifier: chList[0] + chList[1].join()) }
def identifier = mapC(seqC(identifierStart, noneOrMoreC(identifierRest)), toIdentifier)
assert identifier('bbc2###') == ... success with Identifier object ...
A larger example would be the Groovy while statement:
whileStatement: "while" LPAREN strictContextExpression RPAREN compatibleBodyStatement
The corresponding definition is:
def whileStatement = mapC(seqCs([whileKeyword, lParen, expression, rParen, compatibleBodyStatement]), toWhileStatement)
with the mapping transformer constructing an object of the class WhileStatement:
class WhileStatement {
def expression
def body
}
Further Developments
The combinators described in this series of articles have been stable since the
summer of 2007. They have allowed me to develop a number of simple external DSLs.
Since a BNF grammar defining some language has a fixed structure, then it is
also possible to define combinators that will parse BNF production rules. Here
the BNF rules act as the external DSL. Action code added to the parser can
produce the combinators for the grammar automatically. In effect we have a
simple compiler-compiler.
Work is underway to produce a variant of the current combinators to give
support to error reporting when a parse detects an input error. This has
involved four small classes and relatively minor changes to the combinators.
This new work is presently being tested.
I am also investigating how the combinators scale when defining parsers for
large grammars. Specifically, I am using the combinators to produce parsers
for large subsets of Groovy. Currently I have successfully created parsers for
expressions, statements, declarations, class and interface definitions. I
anticipate being able to parse all the Groovy language, which will then be
used for various tools.
The work, however, is still very much a work-in-progress. For example, results
from developing a parser for a large grammar has revealed opportunities for
optimisations. So expect to see continual enhancements of the library.
The sources for this work will be posted shortly on the Groovy site. Users are welcome
to experiment and use the library. Feedback would be welcomed.
Ken Barclay