Monday, August 17, 2020

Kotlin: Parser Combinator

 Kotlin: Parser Combinator

A parser combinator is a higher-order function that accepts several parsers as input and returns a new parser as its output. In this context, a parser is a function accepting strings as input and returning some structure as output, typically a parse tree. Parser combinators enable a recursive descent parsing strategy that facilitates modular piecewise construction and testing. This parsing technique is called combinatory parsing.

In a programming language such as Kotlin that has first-class functions, parser combinators can be used to combine basic parsers to construct parsers for more complex rules. For example, a production rule of a context-free grammar (CFG) may have one or more alternatives and each alternative may consist of a sequence of non-terminal(s) and/or terminal(s). If a simple parser is available for each of these alternatives, a parser combinator can be used to combine each of these parsers, returning a new parser which can recognise any or all of the alternatives.

This blog describes a parser combinator written in Kotlin. The features of Kotlin made it possible to mirror much of the original Haskell code from which it was developed.



ParsecT

ParsecT is the main parser and the central data type in this library. The class is parametrized and defined as:

class ParsecT<TOK, U, A, Z>

where TOK is the type of elements in the input stream (commonly Char), U is the user state, A is the return type, and Z is the value type when an instance is invoked. The class is represented by a single function type property and the member functions deliver a new instance with a suitably modified function property. Hence the class represents a lazy object which only executes when it is finally invoked.

The most convenient way to define various parsers is to define a custom type synonym for your parser. The library defines a typealias entitled Parser:

typealias Parser<A> = ParsecT<Char, Unit, A, Unit>

where the return type A is still a parameter and the input stream is a sequence of Chars.

The object declaration:

object ParsecTF

includes a set of related functions that either operate on one or more ParsecT parameters or delivers a ParsecT instance.



Object declarations CombinatorF, CharacterF and StringF

The object declaration CombinatorF includes a number of functions used to combine parsers. For example, function many parses none or more occurrences of a given parser. Its signature is:

fun <TOK, U, A, Z> many(parsec: ParsecT<TOK, U, A, Z>): ParsecT<TOK, U, List<A>, Z>

Note that the return type for the generated parser is a List<A> where the given parser returns the type A.

The object declaration CharacterF includes a number of val bindings and function declarations for parsing individual characters. The function declarations in CharacterF are like function oneOf which returns a parser that succeeds if the current input character is in the sequence of character parameters. Its signature is:

fun oneOf(vararg chars: Char): Parser<Char>

The val bindings are predefined character parsers. For example, the parser colon will recognize a ':' character in the input. The parser dollar will recognize a '$' character in the input. The parser digit will recognize a single decimal digit in the input.

The object declaration StringF includes the function parse to execute a parser against some input given as a String. The declaration is:

fun <A> parse(parsec: Parser<A>, name: String, text: String): Either<ParseError, A>

The parsec parameter is the parser to execute, the text is the input the parser operates on and name is used to describe the source. The name might be the name of the file from which the text is obtained.

The return type is either a parse error or the return value from the parser. If the parser fails then the return is a Left instance wrapped around a ParseError. If the parser succeeds then the return is a Right wrapped around the parser return.



The Basics

Most of the val bindings in CharacterF are defined with the member function satisfy:

fun satisfy(predicate: (Char) -> Boolean): Parser<Char>

This looks at the next character from the current input, and if the predicate function returns true for this character, it 'pops' it from the input and returns it. In this way, the current position in the input string is tracked behind the scenes. A sample use of this function is:

val pA: Parser<Char> = satisfy{ch: Char -> (ch == 'A')}
assertEquals(right('A'),  parse(pA, "", "ABC"))
assertTrue(parse(pA, "", "BC").isLeft())
println(parse(pA, "", "BC"))

with the final print producing:

Left(ParseError(sourcePosition=(1, 1), messages=<[SysUnExpect: B]>))

with the error on the first character (of the first line) with the unexpected character 'B'.



CharacterF.character

This function returns a parser that matches the current character in the text that we are parsing to whatever character you provide it. Its signature is:

fun character(ch: Char): Parser<Char>

Sample use of this function is:

val pCharacter: Parser<Char> = character('H')
assertEquals(right('H'),    parse(pCharacter, "", "Hello"))
assertTrue(parse(pCharacter, "", "Goodbye").isLeft())

Here, running character('H') returns a parser that will match the single character 'H'. If we use it in parsing our string, which begins with an H, it is perfectly happy. If we try looking for any letter that isn't 'H', we fail. The result is always of type Either<ParseError, Char>; we get back a Right result if the rule was successful, and Left error if the rule fails.



CharacterF.string

This function returns a parser that attempts to match the given string of characters:

val pString: Parser<String> = string("Hello")
assertEquals(right("Hello"),    parse(pString, "", "Hello world"))
assertTrue(parse(pString, "", "Help").isLeft())

The parser will consume characters from the input one by one until all characters match or one of them is not as expected. Since both of the above attempts begin with 'Hel', the error complains about an unexpected 'p'. This consuming of characters will become significant when multiple rules are chained together.



CharacterF.oneOf

Sometimes we want to match a range of characters; that's where CharacterF.oneOf comes in handy. Similar to CharacterF.character, except you provide this function a list of characters you're OK with matching:

val pVowels: Parser<Char> = oneOf('a', 'e', 'i', 'o', 'u')
assertEquals(right('o'),    parse(pVowels, "", "object"))
assertTrue(parse(pVowels, "", "function").isLeft())

Here, we can see that the parser will consume any single character that is a vowel.

Parsec comes with pre-prepared rules which do just that, for example CharacterF.anyChar which will consume one of anything:

val pSymbols: Parser<Char> = anyCharacter
assertEquals(right('H'),    parse(pSymbols, "", "Hello"))
assertEquals(right('+'),    parse(pSymbols, "", "+-*/"))

The rule CharacterF.letter will happily consume any lower or uppercase letter, CharacterF.lower will consume any lowercase letter, CharacterF.digit will consume any single decimal digit, and CharacterF.alphaNum any letter or digit.



CombinatorF.many

There are times when you want to parse more than just one letter. CombinatorF.many tries to run whatever rule it is given as an argument repeatedly until it fails. Even if the rule matches no times, many will return without error, but just give back an empty result. Let's see how it's used:

assertEquals(right(ListF.of('h', 'h', 'h')),                    parse(many(character('h')), "", "hhheee"))
assertEquals(right(ListF.of()),                                  parse(many(character('e')), "", "hhheee"))
assertEquals(right(ListF.of('h', 'h', 'h', 'e', 'e', 'e')),    parse(many(letter), "", "hhheee"))

As we see,CombinatorF.many can't fail, it will match the rule it's given zero times and just return nothing. It'll go as far as it can and give you back everything that matched. CombinatorF.many1 is similar except that the rule provided must match at least once for it to return successfully.



ParsecT.fmap

The ParsecT data type is a functor by supporting the member function fmap which, given any types A and B lets you apply the function (A) -> B to a ParsecT<TOK, U, A, Z> delivering a ParsecT<TOK, U, B, Z>. Thus we can use fmap to transform the return value of a parser:

val pTab: Parser<Char> = tab
val pSpaces: Parser<String> = pTab.fmap{ch: Char -> "    "}
assertEquals(right("    "),     parse(pSpaces, "", "\t"))
assertTrue(parse(pSpaces, "", "A").isLeft())

Combining fmap, many1 and the single character parser digit we can parse a series of decimal digits into an integer:

val pInteger: Parser<Int> = many1(digit).fmap{digs: List<Char> ->
    digs.foldLeft(0){acc: Int -> {ch: Char -> 10 * acc + (ch.toInt() - '0'.toInt())}}
}
assertEquals(right(123),    parse(pInteger, "", "123"))
assertTrue(parse(pInteger, "", "ABC").isLeft())



Combining rules

The simplest way to combine parsers is to execute them in succession. ParsecT is a monad, and monadic bind is exactly what we use for sequencing our parsers. In the following example we sequence three character parsers to recognize ABC:

val pA: Parser<Char> = character('A')
val pB: Parser<Char> = character('B')
val pC: Parser<Char> = character('C')
val pSequence: Parser<Triple<Char, Char, Char>> =
        pA.bind{chA: Char ->
            pB.bind{chB: Char ->
                pC.bind{chC: Char ->
                    inj(Triple(chA, chB, chC))
                }
            }
        }

assertEquals(right(Triple('A', 'B', 'C')),  parse(pSequence, "", "ABC"))

The object declaration ParsecTF includes the function mapM3 which supports an alternative syntax for sequential execution of monadic parsers:

val pA: Parser<Char> = character('A')
val pB: Parser<Char> = character('B')
val pC: Parser<Char> = character('C')
val pSequence: Parser<Triple<Char, Char, Char>> =
        mapM3(pA, pB, pC){chA: Char -> {chB: Char -> {chC: Char -> Triple(chA, chB, chC)}}}

assertEquals(right(Triple('A', 'B', 'C')),  parse(pSequence, "", "ABC"))



Kotlin package declaration parsing

In this section we develop a simple parser for a Kotlin package declaration. A typical package declaration is:

package com.adt.kotlin.parsec

comprising the package keyword, one or more spaces and a qualified name. First we define the parsers for the keyword and the dot separator:

val pPackageKeyword: Parser<String> = StringF.lexeme(string("package"))
val pDot: Parser<Char> = character('.')

In the original parsec documentation, one of the concepts mentioned is the idea of 'lexeme' parsing. This is a style in which every token parser should also consume and ignore any trailing whitespace. This is a simple convention which with a bit of care allows skipping whitespace exactly once wherever it needs to be skipped. The function lexeme performs this task, skipping any whitespace following the package keyword.

An identifier must start with a letter or underscore, and then followed by zero or more letters, underscores or digits in any combination:

val leading: Parser<Char> = satisfy{ch: Char -> Character.isJavaIdentifierStart(ch)}
val trailing: Parser<Char> = satisfy{ch: Char -> Character.isJavaIdentifierPart(ch)}
val pIdentifier: Parser<String> =
        leading.bind{leadingChar: Char ->
            many(trailing).bind{trailingChars: List<Char> ->
                inj(ListF.cons(leadingChar, trailingChars).charsToString())
            }
        }   // pIdentifier

A qualified identifier comprises one or more identifiers separated by a dot.

val pQualifiedName: Parser<QualifiedName> =
        CombinatorF.sepBy1(pIdentifier, pDot).fmap{names: List<String> -> QualifiedName(names)}

Function sepBy1 takes two arguments, a rule to parse items, and a rule to parse separators and returns a list of whatever the rule returns. Using fmap we convert this into a QualifiedName:

data class QualifiedName(val names: List<String>)

The parser for a package declaration is then:

val pPackageDeclaration: Parser<PackageDeclaration> =
        pPackageKeyword.bind{_: String ->
            pQualifiedName.bind{qualifiedName: QualifiedName ->
                inj(PackageDeclaration(qualifiedName))
            }
        }   // pPackageDeclaration

where:

data class PackageDeclaration(val qualifiedName: QualifiedName)

We demonstrate it with:

assertEquals(right(PackageDeclaration(QualifiedName(ListF.of("test", "kotlin")))),
    parse(pPackageDeclaration, "", "package test.kotlin"))
assertTrue(parse(pPackageDeclaration, "", "package test.*").isLeft())
assertTrue(parse(pPackageDeclaration, "", "package").isLeft())
assertTrue(parse(pPackageDeclaration, "", "import test.kotlin").isLeft())



The code for Parsec can be found at:

https://github.com/KenBarclay/Parsec

No comments: