What is a Parser Combinator?
It is a parsing technique in which you create small simple parsers that can be composed
to create complex parsers. Writing a parser by hand is little bit tedious you need to take
care of many things e.g. backtracking, performance, etc. I had attempted to write a
JSON parser from scratch. I could have done it in a much better way had I known a thing
like parser combinators exist. In this post we’ll look at the Angstrom parser combinator library
and build our own minial JSON parser.
What is Angstrom?
Angstrom is a parser-combinator library that makes it easy to write efficient, expressive,
and reusable parsers suitable for high-performance applications. The parsers provided by
angstrom are backtracking by default and support unbounded lookahead.
Getting Started
During the course of the post we’ll be looking at few parser & combinators that angstrom provides
like char, string, take_while1, sep_by, fix, lift2
We’ll also look at operators like >>=, >>|, *>, <*, <|>
Let’s start with a pesy project
npm i -g esy pesy@next
pesy -d json-parser
Add the dependency for Angstrom
esy add @opam/angstrom
Import the dependency
"buildDirs": {
"library": {},
"bin": {
- "imports": [ "Library = require('json-parser/library')" ],
+ "imports": [
+ "Angstrom = require('@opam/angstrom')",
+ "Library = require('json-parser/library')"
+ ],
"bin": { "JsonParserApp": "JsonParserApp.re" }
}
}
Now that we are done setting up the project lets define how our json type would look like
module JSON = {
type t =
| Number(float)
| Boolean(bool)
| String(string)
| Null
| Object(list((string, t)))
| Array(list(t));
}
Next we’ll write parsers to parse the primitive types (Number, Boolean, String, Null)
/* Primitive Parsers */
let nullParser = string("null") *> return(Null);
let trueParser = string("true") *> return(Boolean(true));
let falseParser = string("false") *> return(Boolean(false));
let boolParser = trueParser <|> falseParser;
Here we use the string parser that Angstrom provides to parse the literals, then we use the *> operator to discard the result of parsing and return one of the appropriate variants
Next we’ll write a parser to parse strings
let stringParser =
char('"')
*> take_while1(c => c != '"')
<* char('"')
>>| (s => String(s))
Here we use the char(’"’), <*, *> to consume to quotes and we us the take_while1 parser to consume the characters of the string, the result is passed to the function s ⇒ String(s) using the >>| operator which wraps our variant type inside Angstrom.t
You can think of >>| operator as a map operation which changes the data wrapped inside of Angstrom.t
Let’s write the last primitive parser
let numberParser =
take_while1(
fun
| '0' .. '9'
| '.' => true
| _ => false,
)
>>| (n => Number(float_of_string(n)));
Here we use the similar approach as stringParser we pattern match over the character and consume only the digits and .
Now lets get to the interesting bit, writing the parser for our compound variants
/* helpers */
let comma = char(',');
let colon = char(':');
let keyParser = char('"') *> take_while1(c => c != '"') <* char('"');
let pair = (a, b) => (a, b);
/* putting all things together */
let parse =
fix(parse => {
let arrayParser =
char('[') *> sep_by(comma, parse) <* char(']') >>| (a => Array(a));
let member = lift2(pair, keyParser <* colon, parse);
let objectParser =
char('{')
*> sep_by(comma, member)
<* char('}')
>>| (o => Object(o));
peek_char_fail
>>= (
c =>
switch (c) {
| '"' => stringParser
| 't'
| 'f' => boolParser
| 'n' => nullParser
| '[' => arrayParser
| '{' => objectParser
| _ => numberParser
}
);
});
Woaah! Woaah! that was a lot of code lets try to break it down a bit…
fix is a combinator provided in Angstrom used when parsing recursive grammars,
you can think of it as the rec key-word used while defining recursive functions.
let arrayParser =
char('[') *> sep_by(comma, parse) <* char(']') >>| (a => Array(a));
To parse an array we need to consume the opening and closing braces,
then we use the sep_by combinator to split the commas and use the json parser itself
to parse the values of the array (as values itself can be json objects… the grammar is recursive…)
Next lets look at the objectParser
let pair = (a, b) => (a, b);
let member = lift2(pair, keyParser <* colon, parse);
let objectParser =
char('{') *> sep_by(comma, member) <* char('}') >>| (o => Object(o));
To parse the entries of a json object we define the member parser,
in which we combine 2 parsers,
one to parse the key into a string and the json parser itself to parse the value
using lift2, which take a function to combine the result of both the parsers into a pair
(as in out type definition Object hold list of (string, JSON.t))
Now that we have all the necessary parsers to parse a JSON string,
we need to put it together using
peek_char_fail >>= (
c =>
switch (c) {
| '"' => stringParser
| 't'
| 'f' => boolParser
| 'n' => nullParser
| '[' => arrayParser
| '{' => objectParser
| _ => numberParser
}
);
We use peek_char_fail for looking at the current character while parsing
and based on the character we decide which parser to call next.
That was it! we just wrote a very minimal json parser using Angstrom,
You can check the code out here: https://github.com/melwyn95/reason-snippets/blob/main/bin/ReasonSnippetsApp.re#L40-L127
Feel free to experiment with the different parsers and combinators provided by Angstrom,
Also you can checkout the full blown JSON and HTTP parsers in the Angstrom examples