Parsers are everywhere. Writing a program? You’re using a parser. Opening a PDF or an image? You’re using a parser. Using the internet? Parsers read the packets that go back and forth on the network.

For something so ubiquitous, parsing is a little understood topic. What do parsers do? How do they work? Can I write one?

This is a short post on how you can build a really simple binary string parser using nom. Armed with this knowledge, go forth and parse the world!

Parsing 101

Parsers read input and transform it into another more structured form. A parser can turn a binary image file into a form that your computer can display on the screen. It can also turn a Python script into an abstract syntax tree (AST) that the Python interpreter can use to run your program.

We’re going to build a simple top-down recursive descent parser for binary strings like 00101. All our parser will do is succeed when the input string is valid.

There are two key terms here that influence the design of our parser: top-down and recursive descent.

Top-down means that the parser starts from the top - in our case this means parsing a binary string first and working down to the smallest token (0 or 1). Compare this to a bottom-up parser that would first parse 0 or 1, then work its way up to a whole binary string.

Recursive descent is a technique for writing parsers where recursive functions map (mostly) to the grammar of the language being parsed. Parsing binary strings with these is really simple.

<bin_digit> ::= '0' | '1'
<bin_str> ::= <bin_digit> <bin_str> | <bin_digit>

The grammar above (in Backus-Naur form) defines a bin_digit nonterminal as 0 or 1. It then defines a bin_str nonterminal as either a bin_digit followed by another bin_str, or a bin_digit on its own. In this way we recursively use the bin_str nonterminal to parse any length string.

To implement this parser, we can define two pseudocode functions:

function bin_digit {
	// read '0', read '1', or error
}

function bin_str {
	// call bin_digit, or error
	// call bin_str until error
}

To hammer this concept home, let’s use the above parser to parse the string 101.

  1. Starting with bin_str (remember: top-down!), we call bin_digit and parse 1 successfully.
  2. We call bin_str recursively, and this time parse 0.
  3. In a second recursive call, we parse 1. Then call bin_str again.
  4. This time there’s nothing left to parse, so bin_digit errors. This causes bin_str to stop recursing and we wind up having parsed the whole string: 101.

Building parsers in this way makes the code more legible, as functions roughly map to the grammar. It also helps us take advantage of parser combinators.

Introducing: nom

Parser combinators are the building blocks of a parser. On their own, they parse simple things like characters, digits, and whitespace. They really shine when combined, as they can be used to build more complex parsers.

nom is a Rust crate that provides a set of parser combinators. Using nom, we can easily build our own parsing functions.

In a binary string, we have zeroes and ones. We can easily build combinators for these using nom’s char combinator:

fn zero(input: &str) -> IResult<&str, char> {
    char('0')(input) // '0'
}

fn one(input: &str) -> IResult<&str, char> {
    char('1')(input) // '1'
}

Simple enough. These functions take an input string, apply nom’s char combinator to it and return an IResult containing both the remainder of the input string and the char we’ve parsed. IResult can also contain an error.

Here’s where it gets fun. Now we can build the function for our bin_digit nonterminal, using nom’s alt combinator to parse either our zero or our one combinator.

fn bin_digit(input: &str) -> IResult<&str, char> {
	alt((zero, one))(input) // either '0' or '1'
}

Finally, we can build our bin_str function. Here we use the fold_many0 combinator which applies the bin_digit parser multiple times. It calls our supplied closure each time, which concatenates the parsed character into an accumulator. The accumulator is initialised with an empty string and eventually contains our result: a valid binary string.

fn bin_str(input: &str) -> IResult<&str, String> {
    fold_many0( // fold many 'bin_digit' into a single string until we get an error
        bin_digit,
        move || String::new(),
        |acc, c| format!("{}{}", acc, c),
    )(input)
}

That went from 0 to 100 pretty quickly. Parsing can get incredibly complex and this post is intended to be a super-simple introduction the topic. My advice would be to take the time to digest this practical example and give it a go yourself.

There are plenty more resources out there if you want to go deeper. Go ahead and throw yourself into Crafting Interpreters for an excellent guide on writing your own language!