Making a grammar for TIFF, part 1

I’ve been trying to come up with actual grammars for arbitrary binary file formats. The traditional ways are either “ad hoc” or “map it to a struct”. Neither works out well in practice; that it works at all is more a testament to hard work and tolerance of quirks than it is to correctness and usability.

I’m going to develop a grammar for the TIFF file format. I’m going off of this version of TIFF: TIFF Revision 6.0 Final June 3, 1992. The full grammar I’m working on can be found in this other post: TIFF 6.0 binary file grammar.

TIFF file grammar round 1

Here’s the grammar at a high level

grammar tiff "TIFF 6.0 1992-06-03"

tiff = tiff.header {endian tiff.bom} tiff.data
tiff.header = {bytes 2} tiff.bom {bytes 2} tiff.version
tiff.version = {endian tiff.bom} 42
tiff.bom = (0x49 0x49 {endian little} | 0x4d 0x4d {endian big})
tiff.data = ({bytes 4} offset {seek offset} ifd)* {bytes 4} 0
ifd = {bytes 2} entries {sort ascending tag} entry[entries]
entry = {bytes 2} tag {bytes 2} type {bytes 2} elems data
data = data.small | data.big
data.small = {?_esize <= 4} {bytes _esize} . {bytes 4-_esize} .
data.big = {bytes=4} offset {seek offset} {bytes _esize} .

Text in braces like {endian tiff.bom} are akin to semantic actions; they are not part of the syntax, they either have semantic meaning or are altering where or how data is parsed. We will think of them as semantic actions, because that’s how you’d do this kind of thing in a more classic parser. However, I’m going to define the syntax for semantic actions as opposed to leaving it as “write in some language like C”, because I want the semantic actions to be portable to any target, and my semantic actions have lots of magic in them.

Endianness

tiff = tiff.header {endian tiff.bom} tiff.data
tiff.version = {endian tiff.bom} 42
tiff.bom = (0x49 0x49 {endian little} | 0x4d 0x4d {endian big})

One valid way to look at binary data is as a string of bits. We could do this, but writing grammars like this would be impractically large. Instead, we want to work with data at a higher granularity, and that means we need to assign meaning, and the very first meaning has to do with how we arrange bits and bytes into values. Setting aside actual bit-level grammars for now and focusing on byte-level grammars, if we have a number larger than the range 0..255 (or -128..+127), we need to store it in more than one byte. That means that we need to set a byte order that is consistent between readers and writers of that number.

The TIFF format does not have a predetermined byte order. Instead, the byte order is recorded in an entry in the TIFF data header. In TIFF, the very first two bytes of the TIFF data must be

0x49 0x49

or

0x4d 0x4d

to mean little-endian or big-endian, respectively. I’m using the fairly common “0x” prefix to indicate values written in base 16, also known as hexadecimal, and I’m going to write one number for each 8-bit byte.

It turns out that in ASCII (and UTF-8, and Windows Latin-1, and many other character sets), 0x49 0x49 is “II”, and 0x4d 0x4d is “MM”. This was Adobe’s self-referential way of saying that “II” means that data is stored in the file in little-endian fashion, because the “I” refers to Intel, and that “MM” means big-endian, because “M” refers to Motorola. At the time of the spec, these were the two remaining giants who’d picked opposing endianness for their CPUs when processing data longer than 1 byte in length. For a brief moment in time, it looked like we could assume little-endian, but then ARM came along to challenge Intel, and now we have both again in credible amounts.

So there are two grammars for the rest of the file – one where all the multi-byte values are stored in little-endian fashion, and another where all the multi-byte values are stored in big-endian fashion. We have to either represent this directly in the grammar, or use some sort of short-cut. And, there are other file formats where the endianness can change within the file (for example, in EXIF, which used TIFF as a starting point, you can end up with a file where some data sections have little-endian data, and other data sections have big-endian data).

So I want to introduce a higher-level action that says “for this subset of the data” (and subset could be “entire file”), all multiple-byte values are big endian. My sketch above has some magic in it that I will write out more fully. Suffice it to say that {endian tiff.bom} says “set the endian of the data from this point in the sentence of the grammar to the endianness of the bom”.

Now, I need to set the endianness of the bom itself, and it might look something like this

tiff.bom = (0x49 0x49 {endian little} | 0x4d 0x4d {endian big})

so if I match the left side of the alternation, I am saying tiff.bom has the little-endian property, and if I match the right side, I’m saying that tiff.bom has the big-endian property. This would be extracted as seen above. At this point, little and big are magic predefined entities, and it’s likely that I’ll want to call that out with some syntax different from the (user-defined) terminals and non-terminals.

Semantically, I want endianness to be “block-oriented”, so endianness is specific to the rule (and sub-rules) that it is set in. That’s why you seen {endian tiff.bom} repeated several times.

An aside on verbosity

It’s going to seem like this is unecessarily verbose. It’s the opposite, I expect to end up with a full TIFF grammar that’s a fraction of the size of the spec (dozens of the 121 pages are devoted to describing the TIFF format), and the grammar will be precise, whereas the spec is not. More importantly, if you wanted TIFF load/save code for a new processor or a new language, it’s going to take you days to read the spec and turn it into working code. The point of having a grammar is that you just run the grammar through a new back-end and get code for your new language or platform.

Or imagine doing this on the fly – having a grammar file that is loaded by a program, thus giving that program (within limits) the ability to manipulate that file. And maybe that’s the way revisions of file formats are handled.

Also, you can compose grammars, and that is a key to why I am doing this in the first place.

Size of data

In text-based grammars, we assume there is a logical/abstract “character” and we let the lower-level code worry about how that maps to a physical stream of bytes/bits. In our binary grammars, we have to be explicit.

Looking at the second line in the test grammar, we see a {bytes 2} semantic action. That means that the next item is a 2-byte value. So this

header = {bytes 2} tiff.bom {bytes=2} tiff.version
tiff.version = {endian tiff.bom} 42
tiff.bom = (0x49 0x49 {endian little} | 0x4d 0x4d {endian big})

matches a two-byte bom, and a two-byte explicit decimal 42 in the endianness determined by the bom. This is both a long-winded and yet short way of saying that TIFF data starts with one of these two 4-byte sequences

49 49 2a 00
4d 4d 00 2a

The grammar approach is more verbose, and yet it’s not. There’s no meaning attributed to listing out sequences of bytes, so I would still need some rule to say “the first line means that the rest of the file has little-endian data”. Secondly, I have lost the ability to extend. Third, I want to generate file parsers automatically from my grammar, for any machine and target language.

We have tiff.version as a non-terminal rather than as a hard-coded 42 for two reasons: one, it calls it out as a semantically meaningful entity, and so that some day, a TIFF revision 6.x (or 7) could use a higher version number to allow for creating TIFF images bigger than 4 GB. In fact, this has been proposed, and it’s called bigTIFF.

Also note that the {bytes 2} action for bom is not really necessary, because I explicitly wrote out the bom pattern as two bytes. This was a stylistic choice aiding the reader of the grammar. I have left out {bytes 1} markers on 0x49 and 0x4d, because currently we aren’t doing bit-level grammars. These would need to be introduced once that happens, or we need some grammar-level instruction that says “this grammar has byte alignment”.

As an aside, I do intend to handle even bit-level binary files, say for example ZIP or RAR containers. This can be done with a simple extension, {bits 6}for example. But that means I really need to figure out how to declare literals, since literal size needs to be explicit somehow. I want to write bit strings as well as hex strings.

Random access

tiff.data = ({bytes 4} offset {seek offset} ifd)* {bytes 4} 0
offset = (1|..|2^32-1)
ifd = {bytes 2} entries {sort ascending tag} entry[entries]
entry = {bytes 2} tag {bytes 2} type {bytes 2} elems data

One big difference between this and a grammar for say C++ is that many binary file formats are random-access. You don’t read the file in sequential order starting with the first byte; instead, some bytes are addresses that point to other parts of the file. And at least equally important, the meaning of the file generally depends on the data in the file, so you can’t even parse it in a rudimentary fashion.

Take the common “catalog chunk” used in many binary file formats. The catalog chunk has information describing where data is, and also how to interpret it. You have to read and parse the catalog chunk first, even if the catalog is at the end of the file.

In the case of the TIFF format, there are start-relative pointers to data. Right after the four bytes of the bom and version, there is a 4-byte offset pointing to the start of the first IFD, or Image File Directory. TIFF data can have one or more IFDs in it, and each one describes an image. At the end of each IFD is an offset to the next IFD, and if the offset is 0, this is taken to be a terminator; no more IFDs are present in the file.

We can describe it as so

tiff.data = ({bytes 4} offset {seek offset} ifd)* {bytes 4} 0

There are zero or more pairs of offset and IFD, followed by a 4-byte 0. The layout is that there is a 4-byte offset that points to where the IFD is; in the line above, the offset and the IFD are not necessarily contiguous in memory. Instead, we introduce a semantic action to move the read cursor to offset (and later, we’ll talk about what this is an offset from).

Note that offset is a 4-byte integer stored with a specific endianness, but we decided that endianness is global for now, and we had set it after parsing the bom.

It’s possible that I’ll decide to introduce symbols for seek and endianness (and as we will get to later, we’ll see that endianness is simply a special case of the more encompassing idea of formatting data). I’ll stick with the function-name approach for now, as it is slightly more readable as we’re reasoning things out. The main thing to take away is that we have to be able to move about the data, either skipping forwards, or rewinding backwards.

Something that might stand out as a warning sign is “how do we know our data doesn’t have holes or overlaps?” This is a good question. The former just means waste, whereas the latter is almost certainly a problem. Now, aliasing is different, and that just means re-use, which isn’t a problem in and of itself. The answer here is that we have nothing for now; that’s a lint-like facility that we could use after parsing or generating data. Maybe there’s a way to bake it into the grammar itself; but if we did bake it into the grammar, we would reduce the universality of the grammar. Even without so reducing, the fact that we have a grammar means that we can have a single automated tool that can detect overlaps or holes for any grammar on any data.

There is a problem with the definition as written, and it is that offset cannot be zero, nor can it be odd – all data in a TIFF file is meant to be aligned on word (two-byte) bounds, probably as a nod to efficiency. We’ve been skating over the matching of the terminals; offset isn’t a terminal, it’s a non-terminal.

tiff = tiff.header tiff.data
tiff.data = ({bytes 4} offset {seek offset} ifd)* {bytes 4} 0
offset = (1|..|2^32-1)

And this has reminded us of another data formatting issue other than endianness, and that is that of signed versus unsigned numbers. Binary file formats contain not just 1, 2, 4 or 8-byte numbers (or beyond); they contain signed numbers, unsigned numbers, floating point, rationals, enumerations, strings of many different character sets, arrays, and more. It’s not unheard of to have specially constructed data types that don’t correspond to standard things.

The above definition side-stepped the idea of unsigned. It simply stated that offset is a 4-byte number that is in the range 1 to 2^32-1. In practice, this exactly maps to the idea of unsigned. However, I imagine we will need something more explicit than that.

And we’ll get to alignment at some point.

Dynamic data

Now we get to the hard part, or at least slightly more complicated than we’ve been doing to date.

An IFD (Image File Directory) is a structure consisting of a two-byte count of the number of directory entries, and then sequence of that many directory entries, where each entry is a 12-byte composite of type, tag, count and value or offset to value.

ifd = {bytes 2} entries entry[entries]
entry = {bytes 2} tag {bytes 2} type {bytes 2} elems data

As you can see, I’m using a typical array notation to indicate “repeat N times”. As a simple definition, this covers the image directory itself, but doesn’t quite handle the values.

The value of elems is not the number of bytes in the value, it’s the number of elements of the basic type. The type field specifies the basic type, and the data is an array of elems items of that type. For example, if type was 3, this would be “unsigned short”, a 2-byte datum, and if elems was 4, this would mean 8 bytes of data, comprising a list of 4 unsigned shorts.

The second complication is that offset is only an offset for data that is longer than 4 bytes. Data 4 bytes or less in length is packed into the offset field, left-justified. It’s still subject to the endianness declared in the header, however.

TIFF 6.0 has 12 types; signed and unsigned 1-byte, 2-byte, and 4-byte integers, signed and unsigned 8-byte rational, ASCII zero-terminated string (actually can be multiple strings, each zero-terminated; 4-byte float, 8-byte float, and a “undefined” which is really 8-byte octets that have no predetermined data type. The pedagogy of the latter is interesting, since in practice there is no difference between the “octet” type (a better name than undefined) and the unsigned 1-byte integer.

So an individual element’s size is determined by the type field, in a somewhat arbitrary way. Maybe we could express it like this

type = (type_1B | type_2B | type_4B | type_8B | type_U)
type_1B = (1 | 2 | 6 | 7) {_esize=1}
type_2B = (3 | 8) {_esize=2}
type_4B = (4 | 9 | 11) {_esize=4}
type_8B = (5 | 10 | 12) {_esize=8}
type_U = (0|13..|65535) {_esize=0}

We have a type_U because the TIFF spec states “readers should skip over fields containing an unexpected field type”. Alternatively, we could declare this to be an error. If I had my druthers, I would declare it an error, but write my parser to recover from errors.

Then we could use our semantic _esize value to calculate the size of the data, and then use that when matching to decide if we have embedded data or an offset to the data.

data = data.small | data.big
data.small = {?_esize <= 4} {bytes _esize} . {bytes 4-_esize} .
data.big = {bytes 4} offset {seek offset} {bytes _esize} .

Our data is either embedded in the IFD entry, or pointed to by the offset contained in the entry. I need some sort of predicate to say “match here if the computed size is 4 bytes or less”, and in reality I’ll use something from one of the functional languages. I use “.” here to mean “match anything”, but it’s inconspicuous enough that I’ll probably switch to something else.

There is a problem here, and that is the issue of random access and the file cursor. I haven’t said anything about localizing the cursor, but I am moving it in one case and not the other. In either event, I need the cursor change to be temporary when it fetches the data from the offset, because the cursor needs to be left pointing to the start of the next record. I think it’s enough to say that cursor seeks are localized to the rule they are operating in, so at the end of the rule, the cursor is restore to where it would be if the seek hadn’t happened. I’ll need to write out some cursor actions to make sure that is the case.

Alternatively, we could introduce push and pop operations, although I’m less thrilled about that.

Enumerations

We’re almost done, but the last part is a lot of typing. The tag field in the IFD entry is the field’s name, which is also its meaning. There are 74 defined tags in the TIFF file format. While each field has a low-level type format, that’s only for storage; each tag is free to have a higher-level meaning for its data. For example, ImageWidth and ImageHeight store the width and height of the image in pixels, and you can’t process the TIFF file without reading and understanding these. Compression is actually an enum of 7 different compression types.

In principle, it’s straightforward

tag = (NewSubfileType | SubfileType | .. | Copyright)
NewSubfileType = 254 { stuff here }
SubfileType = 255 { stuff here }
Copyright = 33432 { stuff here }

You would do something with the data at this point, if you were generating a parser from the grammar. For TIFF data, you’d find the RowsPerStrip, StripOffsets and StripByteCounts fields, which point to the actual compressed data for strips of pixels, along with other important tags to help you render the data properly.

This is understating this section, because the TIFF file format subsumes the format of TIFF images, which needs to be described. However, there are no more new elements needed to describe TIFF data, once you have predicates, sizes and endianness.

There are several thousand EXIF tags, by the way. Still, it’s just typing, and it’s far better to have everything explicit, and documented.

Constraints

There are some constraints we haven’t put in yet. The entries in an IFD are meant to be sorted by tag number, in ascending order. Presumably the thought here was for binary search, although the number of entries is rarely enough that binary search would make sense.

Another constraint is that you can only have unique tags in an IFD. I don’t see this written into the spec anywhere, but I’m pretty sure it would have to be this way, or you would not be able to read or parse the entire file.

Another constraint is that valid files require certain tags.

Another constraint is that, if there are multiple images in the TIFF, the first one must be the high-resolution image. Subsequent ones are generally reduced-resolution, say for thumbnails.

The TIFF spec disallows aliasing of data; all data can be referenced from one place only.

Partial file parsing

After all of this, I want to return to one of the implications of why binary files formats usually use random-access; many binary files are designed so that you don’t need to read the entire file. For example, if you have a texture stored as a set of mipmaps, you only want one of the mimaps; reading (or parsing) the other ones would be a waste of time.

Having a recognizer generated from a grammar does not mean that the recognizer has to parse everything. The onus on us, however, is to make sure that the grammars are written in a way that recognizers can take advantage of in skipping over data that is not needed.

 

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>