Sunday, March 9, 2008

Groovy Combinator Parsing

Groovy Combinator Parsing
Part 3: Ambiguous Grammars

The previous article introduced the basic combinators provided by theGParsec library. We showed how they can be combined and how they define parsers that mirror the production rules of the grammar.

We regularly find that our production rules are mutually dependent. For example, with Groovy we have:

whileStatement : 'while' '(' expression ')' compatibleBodyStatement
ifStatement : ...
statement : whileStatement ifStatement ...
blockBody : ( statement separators )*
compoundStatement : '{' blockBody '}'
compatibleBodyStatement : compoundStatement statement





In this extract the compatibleBodyStatement is used to define a whileStatement. The compatibleBodyStatement is either a compoundStatement or a statement, and the latter is any of the permissible statements, including the whileStatement.

To define a Groovy parser for a whileStatement, we would first have to define a parser for compatibleBodyStatement. This would require a parser definition for statement which would, in turn, require whileStatement.

Mutual dependencies are readily handled using the lazyC combinator:


def lazyC = { p ->
return { inp ->
return p[0](inp)
}
}


Here p represents a proxy for the actual parser, held as the first element of a list. Our productions are now defined in Groovy as:


def compatibleBodyStatementProxy = []
def whileStatement = seqCs([whileKeyword, lParen, expression, rParen, lazyC(compatibleBodyStatementProxy)])
def ifStatement = ...
def statement = altCs([whileStatement, ifStatement, ...])
def blockBody = noneOrMoreC(seqC(statement, separators))
def compoundStatement = seq3C(lBrace, blockBody, rBrace)
def compatibleBodyStatement = altC(compoundStatement, statement)
compatibleBodyStatementProxy << compatibleBodyStatementProxy



that can then be used to define a whileStatement. At the end of the code we place the actual compatibleBodyStatement into the proxy.


Ambiguous Grammars

Many grammars are ambiguous and require an arbitrary lookahead to determine the correct alternative in a production rule to follow. For example, Groovy statements can begiven an identifying label. Potentially, that identifier may be a label on a statement or an identifier used on the left of an assignment statement. The Groovy production rule is:


statement : whileStatement ifStatement ... labelledStatement
labelledStatement : identifier ':' statement statement


The tryC combinator behaves like its parser parameter, but pretends its has notconsumed any input when it fails. Consider the parser altC(tryC(p), q). Even when the parser p fails while consuming some input, the choice combinator will try thealternate q. The resulting Groovy definitions are:


def statement = altCs([whileStatement, ifStatement, ..., labelledStatement])

def labelledStatement = altC(tryC(seq3C(identifier, colon, statement)), statement)





Compiling

In the final installment we shall consider how we can adapt our parsers to construct abstract syntax trees to represent the parsed input. We will also look at further developments of this work.

Ken Barclay


No comments: