How to parse zone files

For the past few weeks I've been writing a DNS server in Go. Currently I'm working on zone file parsing, and I've noticed that the standards are pretty unclear on how exactly zone files should be parsed. I think I've figured it out though, so this article will explain exactly how to parse them.

DNS zone files are described in RFC 1035 section 5, "Master Files". The term "master file" is so incredibly vague, though, so for the rest of this article I'll be calling them "zone files". Zone files contain a set of "records", which a DNS server can provide to a client when asked. Each record consists of five pieces of information:

The domain name
The time to live (AKA TTL, used for caching)
The resource's class (this is almost always IN, for "Internet")
The resource's type
The resource's data

In the standard, the domain name can be replaced by a <blank>, which is not a standard ABNF token, although it's fine because the DNS was standardized before ABNF was. This really just means that the domain name MUST occur at the beginning of a line, and if it doesn't, then it's assumed to be the same as the previous record in the file.

This creates some interesting problems for parsers. I originally tried parsing with goyacc, but this and various other problems we'll get into made it very difficult to write a valid tokenizer. It's very difficult to keep track of whitespace at the beginning of a line but ignore it everywhere else. It's possible with some specially written yacc rules that merge whitespace with a previous token, but it makes the code a lot messier.

The TTL and resource class are also optional, but the type isn't. Every record in our zone file MUST have a type. This leads to even more problems with the tokenizer. Consider, for example, this zone file:

@ IN MX MX  ; line 1
@ IN MX IN  ; line 2

In line 1, how can we use yacc to distinguish the "MX" (resource type) from "MX" (resource value)? And in line 2, how can we use yacc to distinguish the "IN" (resource class) from "IN" (resource value)? I'm not a yacc expert, but I'm pretty sure that with a simple lex+yacc configuration this is literally impossible. yacc isn't smart enough to conditionally convert a token generated by lex, and lex isn't smart enough to conditionally scan a token based on the current yacc state.

We have to parse these files manually. This actually isn't too bad. Each resource can be thought of as a series of "tokens". Our example zone file, for example, could be rewritten as

["@", "IN", "MX", "MX"]
["@", "IN", "MX", "IN"]

Thankfully, the standard gives us some pretty simple rules for parsing these tokens. Any character can be escaped by a backslash, digits are escaped in groups of three (as in "\065"), parentheses allow for the tokens within a record to span multiple lines, and semicolons indicate comments. We can quite easily write a tokenizer by hand with these rules to generate these resource records as lists of strings. In my DNS server, the tokenizer doesn't actually process the things that it's escaping, it just reads the characters and leaves escape code handling for the parser.

In the parser, we can rely on the fact that every record MUST have a resource type, and that there is a known list of valid resource types. We just go through each token in our record, find the first valid resource type, assume that everything before then is a part of the domain name, class, and TTL, and assume that everything after that is a part of the resource data.

The standard makes no mention of these special tokens or this algorithm, this is just something I made up because it seemed close enough to the standard and I couldn't find any decent counterexamples. I feel like there could be a security vulnerability here, similar to the Python url parsing vulnerability discovered last year, where my algorithm is slightly different to everybody else's which allows attackers to slip a malicious payload by my checker, but I'm too lazy to figure out what it is.