Grammars for arbitrary bitstrings

Note: this was an older direction that I might return to.

Grammars are defined on a string, but that “string” is usually a subset of all possible strings; e.g. it is usually a string of “human-readable characters”. This bias is due to the fact that grammars and parsing started with languages like English and Fortran. Instead, what we want is a way to write a precise grammar for any language, not just languages expressed as human-readable characters. Also, we want grammars, because they can both be used as recognizers and generators, and potentially recognize/parse files in linear time.

To avoid encoding shortcuts (and thus limiting the expressiveness of our grammars), we want to define grammars on bitstrings. Thus, we only need two terminals, 0 and 1. We’ll build up some common primitives, but they are for convenience only.

However, it’s easy to define grammars as being on bitstrings, albeit a little more cumbersome. Our alphabet is just the two 1-bit values 0 and 1. We’re going to start with the syntax of ABNF, as clumsy as that is (although we’ll abandon it soon, since we’ll want the freedom to declare what a character in a string actually means). We might define a 1-byte value and a 4-byte value like this:

OCTET = 8*8(0|1)
DWORD = 4*4OCTET

Note that in neither definition do we require that an OCTET or DWORD starts on an 8-bit bounds. Padding has to be explicit in our grammars.

But right off the bat, we have issues with bit-endianness and byte-endianness once we want to interpret values (which we need to do in order to make an actual language). Yes, an OCTET is 8 consecutive bits from the string, but if you want to reason about it as a value in the range 0..255, you would need to decide if the subset bitstring is big-endian (most significant bit is leftmost in the bitstring) or little-endian. And likewise, a DWORD is 4 consecutive OCTETs, but are those organized in a big-endian fashion or little-endian fashion.

We could tediously construct our values (with semantic actions)

b0 = (0|1)
b1 = (0|1)
b2 = (0|1)
b3 = (0|1)
b4 = (0|1)
b5 = (0|1)
b6 = (0|1)
b7 = (0|1)
OCTECT_BE = b7 b6 b5 b4 b3 b2 b1 b0 { $$ = ($1 << 7) | ... ($8); }

Or something like that. It’s likely that we will quickly fall back on some predefined primitives, but it’s nice to know that we can define all the primitives ourselves. We might even want to steal the “little” and “big” prefixes from DataScript, so we might simply have

. = (0|1)
OCTET = big ........

and then the nonterminal’s value is converted as a big-endian bitstring.

Another challenge is that grammars involve a left-to-right parse, but many files’ interpretation involves moving about in the bitstring. A sequence of bits early on might be an offset to a different section in the bitstring that needs to be understood before more bits can be parsed.

Appendix

Why not ASN.1? Because it’s not suitable. It’s complex, and yet incomplete. It separates abstract notation from encoding, which we don’t want (we want to keep them together), and it can’t handle random-access formats.

Why not BFF (“A Grammar for Binary File Formats”)? It limits itself to a bytestring, not a bitstring.

Something intriguing: Synalyze It!. (sigh, people who put exclamation marks in names). In looking at the “grammar” for ZIP files, it falls back to a fair amount of LUA. It’s really a decompiler, not a grammar; I’m not seeing how this can be used to generate strings (e.g. legal ZIP files). But still, it’s interesting.

Another interesting commercial candidate is 010 Editor, with the brag “Edit Anything”. It’s an interesting approach, use C structs with embedded semantic actions. Again, not a grammar, not generative. It’s really a specialized language for parsing binary data.

Links

Parsing of Context Sensitive Grammars

Earley Parsing for Context-Sensitive Grammars

Sample Parsers from Parsing Techniques

The Convergence of Mildly Context-Sensitive Grammar Formalisms

Genetic programming with context-sensitive grammars

A new method for dependent parsing (Yakker)

Van Wijngaarden grammar

xtc (eXTensible Compiler) project.

PacketTypes: Abstract Specification of Network Protocol Messages

 

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>