Knit, component definition and linking

Good paper

Knit: Component Composition for Systems Software

I want something better than what is currently available, at least to people writing system software or “shrinkwrap” desktop software. The problems with libraries and object files are numerous. Some of the things I’ve struggled with:

I want to override something in a library. For example, and for good reasons, I want to replace the global new function in C++ with myown function; there is no way to override this in a sane and error-free fashion. There are some tricks that work, but they are tricks. But this is true for any function. I really do want to replace malloc, because the ones given to me are crap.

Get out of my namespace! Namespaces are global, and namespace pollution becomes an ever bigger problem as programs get bigger.

Ten versions of Zlib. There are many libraries that link to specific versions of other libraries. When you compose all those together into one program, likely results include at the least bloat, and more often broken behavior, because library revisions change interfaces or add or remove bugs.

Configuration is a nightmare. I have to physically locate specific versions of libraries, which means either building them myself, or revisiting builds as I add libraries or change versions of libraries.

Library boundaries are very granular, meaning significant fractions of the code are involved with building up and tearing down passed parameters.

I want to insert stuff in the middle. If library A calls library C, but I want it to call library B which calls library C, I can’t do that. You could design library A to allow it, but then it’s slower and more complicated to use, especially when it’s a minority use case.

 

Grammars for arbitrary bitstrings

Note: this was an older direction that I might return to.

Grammars are defined on a string, but that “string” is usually a subset of all possible strings; e.g. it is usually a string of “human-readable characters”. This bias is due to the fact that grammars and parsing started with languages like English and Fortran. Instead, what we want is a way to write a precise grammar for any language, not just languages expressed as human-readable characters. Also, we want grammars, because they can both be used as recognizers and generators, and potentially recognize/parse files in linear time.

To avoid encoding shortcuts (and thus limiting the expressiveness of our grammars), we want to define grammars on bitstrings. Thus, we only need two terminals, 0 and 1. We’ll build up some common primitives, but they are for convenience only.

However, it’s easy to define grammars as being on bitstrings, albeit a little more cumbersome. Our alphabet is just the two 1-bit values 0 and 1. We’re going to start with the syntax of ABNF, as clumsy as that is (although we’ll abandon it soon, since we’ll want the freedom to declare what a character in a string actually means). We might define a 1-byte value and a 4-byte value like this:

OCTET = 8*8(0|1)
DWORD = 4*4OCTET

Note that in neither definition do we require that an OCTET or DWORD starts on an 8-bit bounds. Padding has to be explicit in our grammars.

But right off the bat, we have issues with bit-endianness and byte-endianness once we want to interpret values (which we need to do in order to make an actual language). Yes, an OCTET is 8 consecutive bits from the string, but if you want to reason about it as a value in the range 0..255, you would need to decide if the subset bitstring is big-endian (most significant bit is leftmost in the bitstring) or little-endian. And likewise, a DWORD is 4 consecutive OCTETs, but are those organized in a big-endian fashion or little-endian fashion.

We could tediously construct our values (with semantic actions)

b0 = (0|1)
b1 = (0|1)
b2 = (0|1)
b3 = (0|1)
b4 = (0|1)
b5 = (0|1)
b6 = (0|1)
b7 = (0|1)
OCTECT_BE = b7 b6 b5 b4 b3 b2 b1 b0 { $$ = ($1 << 7) | ... ($8); }

Or something like that. It’s likely that we will quickly fall back on some predefined primitives, but it’s nice to know that we can define all the primitives ourselves. We might even want to steal the “little” and “big” prefixes from DataScript, so we might simply have

. = (0|1)
OCTET = big ........

and then the nonterminal’s value is converted as a big-endian bitstring.

Another challenge is that grammars involve a left-to-right parse, but many files’ interpretation involves moving about in the bitstring. A sequence of bits early on might be an offset to a different section in the bitstring that needs to be understood before more bits can be parsed.

Appendix

Why not ASN.1? Because it’s not suitable. It’s complex, and yet incomplete. It separates abstract notation from encoding, which we don’t want (we want to keep them together), and it can’t handle random-access formats.

Why not BFF (“A Grammar for Binary File Formats”)? It limits itself to a bytestring, not a bitstring.

Something intriguing: Synalyze It!. (sigh, people who put exclamation marks in names). In looking at the “grammar” for ZIP files, it falls back to a fair amount of LUA. It’s really a decompiler, not a grammar; I’m not seeing how this can be used to generate strings (e.g. legal ZIP files). But still, it’s interesting.

Another interesting commercial candidate is 010 Editor, with the brag “Edit Anything”. It’s an interesting approach, use C structs with embedded semantic actions. Again, not a grammar, not generative. It’s really a specialized language for parsing binary data.

Links

Parsing of Context Sensitive Grammars

Earley Parsing for Context-Sensitive Grammars

Sample Parsers from Parsing Techniques

The Convergence of Mildly Context-Sensitive Grammar Formalisms

Genetic programming with context-sensitive grammars

A new method for dependent parsing (Yakker)

Van Wijngaarden grammar

xtc (eXTensible Compiler) project.

PacketTypes: Abstract Specification of Network Protocol Messages