A gentle Introduction to Parser Combinators and Angstrom


#1

What is a Parser Combinator?

It is a parsing technique in which you create small simple parsers that can be composed
to create complex parsers. Writing a parser by hand is little bit tedious you need to take
care of many things e.g. backtracking, performance, etc. I had attempted to write a
JSON parser from scratch. I could have done it in a much better way had I known a thing
like parser combinators exist. In this post we’ll look at the Angstrom parser combinator library
and build our own minial JSON parser.

What is Angstrom?

Angstrom is a parser-combinator library that makes it easy to write efficient, expressive,
and reusable parsers suitable for high-performance applications. The parsers provided by
angstrom are backtracking by default and support unbounded lookahead.

Getting Started

During the course of the post we’ll be looking at few parser & combinators that angstrom provides
like char, string, take_while1, sep_by, fix, lift2

We’ll also look at operators like >>=, >>|, *>, <*, <|>

Let’s start with a pesy project

    npm i -g esy pesy@next
    pesy -d json-parser

Add the dependency for Angstrom

    esy add @opam/angstrom

Import the dependency

"buildDirs": {
     "library": {},
     "bin": {
-      "imports": [ "Library = require('json-parser/library')" ],
+      "imports": [
+        "Angstrom = require('@opam/angstrom')",
+        "Library = require('json-parser/library')"
+      ],
       "bin": { "JsonParserApp": "JsonParserApp.re" }
     }
   }

Now that we are done setting up the project lets define how our json type would look like

module JSON = {
  type t =
    | Number(float)
    | Boolean(bool)
    | String(string)
    | Null
    | Object(list((string, t)))
    | Array(list(t));
}

Next we’ll write parsers to parse the primitive types (Number, Boolean, String, Null)

/* Primitive Parsers */
let nullParser  = string("null")  *> return(Null);
let trueParser  = string("true")  *> return(Boolean(true));
let falseParser = string("false") *> return(Boolean(false));
let boolParser  = trueParser <|> falseParser;

Here we use the string parser that Angstrom provides to parse the literals, then we use the *> operator to discard the result of parsing and return one of the appropriate variants

Next we’ll write a parser to parse strings

let stringParser =
    char('"')
    *> take_while1(c => c != '"')
    <* char('"')
    >>| (s => String(s))

Here we use the char(’"’), <*, *> to consume to quotes and we us the take_while1 parser to consume the characters of the string, the result is passed to the function s ⇒ String(s) using the >>| operator which wraps our variant type inside Angstrom.t

You can think of >>| operator as a map operation which changes the data wrapped inside of Angstrom.t

Let’s write the last primitive parser

let numberParser =
    take_while1(
      fun
      | '0' .. '9'
      | '.' => true
      | _ => false,
    )
    >>| (n => Number(float_of_string(n)));

Here we use the similar approach as stringParser we pattern match over the character and consume only the digits and .

Now lets get to the interesting bit, writing the parser for our compound variants

/* helpers */
let comma = char(',');
let colon = char(':');
let keyParser = char('"') *> take_while1(c => c != '"') <* char('"');
let pair = (a, b) => (a, b);

/* putting all things together */
  let parse =
    fix(parse => {
      let arrayParser =
        char('[') *> sep_by(comma, parse) <* char(']') >>| (a => Array(a));

      let member = lift2(pair, keyParser <* colon, parse);
      let objectParser =
        char('{')
        *> sep_by(comma, member)
        <* char('}')
        >>| (o => Object(o));

      peek_char_fail
      >>= (
        c =>
          switch (c) {
          | '"' => stringParser
          | 't'
          | 'f' => boolParser
          | 'n' => nullParser
          | '[' => arrayParser
          | '{' => objectParser
          | _ => numberParser
          }
      );
    });

Woaah! Woaah! that was a lot of code lets try to break it down a bit…

fix is a combinator provided in Angstrom used when parsing recursive grammars,
you can think of it as the rec key-word used while defining recursive functions.

let arrayParser = 
  char('[') *> sep_by(comma, parse) <* char(']') >>| (a => Array(a));

To parse an array we need to consume the opening and closing braces,
then we use the sep_by combinator to split the commas and use the json parser itself
to parse the values of the array (as values itself can be json objects… the grammar is recursive…)

Next lets look at the objectParser

let pair = (a, b) => (a, b);
let member = lift2(pair, keyParser <* colon, parse);
let objectParser =
  char('{') *> sep_by(comma, member) <* char('}') >>| (o => Object(o));

To parse the entries of a json object we define the member parser,
in which we combine 2 parsers,

one to parse the key into a string and the json parser itself to parse the value
using lift2, which take a function to combine the result of both the parsers into a pair
(as in out type definition Object hold list of (string, JSON.t))

Now that we have all the necessary parsers to parse a JSON string,
we need to put it together using

peek_char_fail >>= (
    c =>
      switch (c) {
      | '"' => stringParser
      | 't'
      | 'f' => boolParser
      | 'n' => nullParser
      | '[' => arrayParser
      | '{' => objectParser
      | _ => numberParser
      }
);

We use peek_char_fail for looking at the current character while parsing
and based on the character we decide which parser to call next.

That was it! we just wrote a very minimal json parser using Angstrom,

You can check the code out here: https://github.com/melwyn95/reason-snippets/blob/main/bin/ReasonSnippetsApp.re#L40-L127

Feel free to experiment with the different parsers and combinators provided by Angstrom,
Also you can checkout the full blown JSON and HTTP parsers in the Angstrom examples


Advent of Code 2020 - Starter kit and tips
#2

Thanks. Great tutorial. If I could offer one piece of feedback, I think it would be better to use the let-operators included in the Angstrom module instead of using the the specialized combinators like *>, <*, liftN. The reason being that the let-operators show a general ‘control flow’ in an ‘imperative’ style which makes it easier to follow like ‘first this gets parsed, then that gets parsed’, while of course in reality the parsing is applicative and thus happening in ‘parallel’. I think it makes it easier also to have to learn less of the combinators.

E.g.:

let nullParser = {
  let+ _ = string("null");
  Null;
};
...
let digits = take_while1(fun '0'..'9' => true | _ => false);

let numberParser = {
  let+ whole = digits;
  and+ _ = option('.', char('.'));
  and+ frac = option("0", digits);
  Number(float_of_string(whole ++ "." ++ frac));
};
...
let member = {
  let+ k = keyParser;
  and+ _ = colon;
  and+ v = parse;
  pair(k, v);
};

#3

Hey @yawaramin, Thanks for the feedback, I had not used let operators before so it was a nice learning experience,
I wrote the JSON parsing example using let operators (https://github.com/melwyn95/reason-snippets/blob/main/library/JsonOperators.re)
I agree they allow you to reason about the control flow in a much better way
Thanks again for the tip