Category Archives: Parsing

Grammars for binary files

Another name for this in the literature is “data description languages”, and another search term is “dependent parsing”. There’s not much out there that I truly like.

My example targets are things like ICC Profile, JPEG (nee JFIF/EXIF), ELF, PEFF, and the like. The idea is that a precise and readable description of a file format are desirable, to facilitate not just robust ways of working with files, but to open up working with file formats to arbitrary uses on any platform. Also, most file formats, even those with verbose documentation, are written with enough ambiguity that resultant implementations are buggy or not robust. Complications in many file formats include things like offsets into pools- are all the offsets good, do we have overlap in items, is there garbage in the pool?

And specifically, I want something that can be used to write composable grammars – take two or more separate grammars and squash them together into a single grammar.

RFC 31 – Binary Message Forms in Computer is too simplistic, it’s not really meant to be something to specify grammars in.

There is BFF: A grammar for Binary File Formats, but while it moves a little in the right direction, it’s too focused on simple binary file formats. It does allow for constructing values as little-endian or big-endian.

A commercial product named 010 Editor has a template format that’s also another step, but again it’s more about mapping structure definitions to binary data than in being a way to specify grammars.

The WARC File Format is specified in ABNF, but it’s not rigorous enough to feed into a parser generator on its own.

DataScript – a Specification and Scripting Language for Binary Data is a better attempt at grammars for binary files. It’s still too tied to the idea of structures, but it does allow for bitfields, size and byte order, length fields, and pointers as offsets. Paradoxically, too much is built in for me to feel it can be general-purpose. Also, it’s not really making parsers, but doing constraint satisfaction.

PacketTypes is an older more limited system along the same vein. However, it has the right idea of starting with a single primitive type bit.

PADS: A domain-specific language for processing ad hoc data. PADS is an acronym for Processing Ad Hoc Data Sources. It has a slight bias for presenting data as if it were XML, and it uses a recursive-descent parser. There is a home page for the project at http://padsproj.org.

David Eger’s Bit Level Types is slightly more interesting, in that the tact is “concisely describing the layout of a data structure as a string of bits”. It’s still not a grammar, though, but a layout specification. And as such, it is narrow in that it has the limitations of LL(k) in that the data can be parsed predictively with a few bits of lookahead.

An older paper might have the right approach: Efficient Demultiplexing of Network Packs by Automatic Parsing. It does LR parsing, which could be extended to GLR. It talks about combining filters into a composite specification, which is akin to what I want to do. It’s also interesting in that it has the idea of skipping states and chaining states (skipping bits that are not important, and chaining multiple bits to recognize them as one state).

Another old paper, Adapative Pattern Matching on Binary Data, discusses using Erlang to parse and process binary data.

The Fischer, Mandelbaum, Walker paper The Next 700 Data Description Languages tries to add rigor.

A new method for dependent parsing introduces the idea of implementing dependent parsing by extending existing parsing algorithms. Dependent grammars were originally introduced to more cleanly specify context-sensitive syntax. This is coming at the problem from a traditional parsing approach, but I think it could be extended easily to parsing typical binary file formats.

The answer is probably to do what YAKKER does, and that means I need to learn OCaml. Many binary formats need variable binding (e.g. length fields or offsets) as well as data-dependent constraints. They call the binding attribute-directed parsing, since constraints are evaluated during parsing, not afterwards.

LL Parsing

Again, write some stuff.

LL parsing’s appeal is largely because you can hand-write an LL parser, even for a complex language like C++ (it’s definitely not practical and maybe not even possible to hand-write an LR parser for most languages). Now, even though it’s intuitive, the intuition is I think wrong, because the parsing algorithm in your brain (for reading English, for example) is almost certainly not top-down (insert references to recent experiments). But it certainly is simpler, and one reason it’s simpler is that LL grammars are a subset of LR grammars (which are in turn a subset of all context-free grammars).

More specifically, if you have an EBNF grammar, you can almost write the LL parser from the grammar just by turning non-terminals into function calls, and terminals into matching against the input.

Another appeal for LL parsing is that it’s easier to do useful compiler errors and error recovery. This may again be due to not fully understanding LR parsing in terms of errors. And this is probably due to the bias against ambiguous grammars (mathematicians dislike ambiguity).

Fisher and LeBlanc “Crafting a Compiler” has a good summary of LL versus LR.

http://cs.stackexchange.com/questions/43/language-theoretic-comparison-of-ll-and-lr-grammars has some notes about theoretic bounds of complexity for LL and LR grammars. LL is a subset of LR(1).

blacc – the Burk Labs Anachronistic Compiler Compiler is “yet another parser generator”. This one is pretty small, has some good ideas, and is LL(1) with extensions (e.g. does left corner transformations so that left recursion handling is algorithmic instead of ad-hoc).

SID (from the TenDRA project) is another LL(1) parser generator.

Of course, the most famous LL parser generator is ANTLR (originally PCCTS), which is on its fourth major redesign. And Terence Parr is a vociferous defender of LL parsing: http://compilers.iecc.com/comparch/article/92-05-059 and http://www.antlr.org/article/needlook.html.

 

 

GLL Parsing

GLL Parsing is Generalized LL Parsing, which is a generalized parser using LL parsing (left-to-right input, leftmost derivation). GLL parsing allows left-recursion. LL has several advantages over LR (easier to understand, can be written by a person, can be implemented with recursive descent). However, LL by itself is fairly limited, it doesn’t allow left-recursion (see Limitations of LL(1) Parsing).

What looks like a fundamental paper came out in 2010 from Elizabeth Scott and Adrian Johnstone: GLL Parsing

Recursive Descent (RD) parsers are popular because their control flow follows the structure of the grammar and hence they are easy to write and to debug. However, the class of grammars which admit RD parsers is very limited. Backtracking techniques may be used to extend this class, but can have explosive runtimes and cannot deal with grammars with left recursion. Tomita-style RNGLR parsers are fully general but are based on LR techniques and do not have the direct relationship with the grammar that an RD parser has. We develop the fully general GLL parsing technique which is recursive descent-like, and has the property that the parse follows closely the structure of the grammar rules, but uses RNGLR-like machinery to handle non determinism. The resulting recognisers run in worst-case cubic time and can be built even for left recursive grammars.

(the entire proceedings of the conference this paper was introduced at: Preliminary Proceedings of the Ninth Workshop on Language Descriptions Tools and Applications LDTA 2009 – page 113).

Another recent thesis uses GLL parsers to facilitate the design and development of domain specific languages (DSLs):

Language embedding (extending a language by embedding another language within it) is one big arena for generalized parsing techniques.

Daniel Spiewak has started doing work with GLL parsers – see his series of articles

Laurie Tratt has been writing about parsing lately. He wrote a very quoted paper (“Parsing: The Solved Problem that Isn’t” in 2011) saying that grammar composition is important, that ambiguity is important. He also wrote a paper showing how to do left-recursion safely in PEG parsing, even though he seems to think that PEG is not the way to go.

Ben Basten wrote a paper on ambiguity detection methods for context-free grammars, which is useful once you start using generalized parsing techniques (since it is easier to end up with unwanted ambiguities in your grammar). LL and LR don’t tolerate ambiguity, which is both good and bad – it’s a straightjacket, but it’s convenient. There’s probably some proof showing that good grammars are ambiguous, but we only want certain kinds of ambiguity, the kinds that lead to more expressiveness than the confusion they can introduce.

Parsing

GLR parsing

GLR is the way to go. Really.

Marpa

active as of late 2012. http://jeffreykegler.github.com/Marpa/. Paper: http://cloud.github.com/downloads/jeffreykegler/Marpa-theory/recce.pdf. Google groups: https://groups.google.com/forum/#!forum/marpa-parser

Accent

last update in 2006? http://accent.compilertools.net/

Elkhound

Not really active any more.

Elkhound home page and Google talk that Scott McPeak gave in 2006. Also 2004 paper. http://www.cs.berkeley.edu/~necula/Papers/elkhound_cc04.pdf

DParser

http://dparser.sourceforge.net/

Scannerless GLR

http://oai.cwi.nl/oai/asset/12772/12772A.pdf

ASF+SDF -> Rascal

http://www.meta-environment.org/ and http://www.rascal-mpl.org/. See http://en.wikipedia.org/wiki/ASF%2BSDF_Meta_Environment too.

GLL Parsing

Maybe GLL is the way to go?

GLL Parsing paper by Elizabeth Scott and Adrian Johnstone. Other papers

Stratego/XT and Spoofax.

http://strategoxt.org/ and http://strategoxt.org/Spoofax.

Books

Dick Grune wrote a book on just parsing, named Parsing Techniques/

Articles

http://tratt.net/laurie/tech_articles/articles/parsing_the_solved_problem_that_isnt

Packrat/PEG

Not sure that these are worth studying, I think they are a dead end. Yes, you can compose two PEG grammars together, but a PEG grammar is more limited than LR . This is the canonical page: http://bford.info/packrat/. ANTLR is the most significant parser generator using PEG parsing.

Other

http://dinosaur.compilertools.net/ – home page for Lex and Yacc.

Converge is interesting: http://convergepl.org/

Cling is an interactive C++ interpreter based on LLVM and clang.

Some useful notes on parsing and compilers from Lutz Hamel at University of Rhode Island: CSC 402 – Lecture Notes

Why I prefer LALR parsers – “[elimination of left-recursion] is achieved by converting left-recursion to right-recursion; which will always convert an S-attribute grammar into an L-attribute grammar; and seriously complicate any L-attribute grammar.”

http://code.google.com/p/blacc/wiki/DesignLeftRecursion