TIFF 6.0 binary file grammar

This is the entire reference grammar, referred to by other posts. Or rather, this will be the entire grammar, I will keep updating it until it is complete.

grammar tiff "TIFF 6.0 1992-06-03"

tiff = tiff.header {endian tiff.bom} tiff.data
tiff.header = {bytes 2} tiff.bom {bytes 2} tiff.version
tiff.version = {endian tiff.bom} 42
tiff.bom = (0x49 0x49 {endian little} | 0x4d 0x4d {endian big})

tiff.data = ({bytes 4} offset {seek offset} ifd)* {bytes 4} 0
offset = (1|..|2^^32-1)

ifd = {bytes 2} entries {sort ascending tag} entry[entries]
entry = {bytes 2} tag {bytes 2} type {bytes 2} elems data

type = (type_1B | type_2B | type_4B | type_8B | type_U)
type_1B = (1 | 2 | 6 | 7) {_esize=1}
type_2B = (3 | 8) {_esize=2}
type_4B = (4 | 9 | 11) {_esize=4}
type_8B = (5 | 10 | 12) {_esize=8}
type_U = (0|13|..|65535) {_esize=0}

data = data.small | data.big
data.small = {?_esize <= 4} {bytes _esize} . {bytes 4-_esize} .
data.big = {bytes 4} offset {seek offset} {bytes _esize} .

tag = (Artist | BitsPerSample | CellLength | CellWidth
  | ColorMap | Compression | Copyright | DateTime | DocumentName
  | DotRange | ExtraSamples | FillOrder | FreeByteCounts
  | FreeOffsets | GrayResponseCurve | GrayResponseUnit
  | HalftoneHints | HostComputer | ImageDescription | ImageLength
  | ImageWidth | InkNames | InkSet | JPEGACTables | JPEGDCTables
  | JPEGInterchangeFormat | JPEGInterchangeFormatLength
  | JPEGLosslessPredictors | JPEGPointTransforms | JPEGProc
  | JPEGQTables | JPEGRestartInterval | JbCbCrSubSampling | Make
  | MaxSampleValue | MinSampleValue | Model | NewSubfileType
  | NumberOfInks | Orientation | PageName | PageNumber
  | PhotometricInterpretation | PlanarConfiguration | Predictor
  | PrimaryChromaticities | ReferenceBlackWhite | ResolutionUnit
  | RowsPerStrip | SMaxSampleValue | SMinSampleValue
  | SampleFormat | SamplesPerPixel | Software | StripByteCounts
  | StripOffsets | SubfileType | T4Options | T6Options
  | TargetPrinter | Thresholding | TileByteCounts | TileLength
  | TileOffsets | TileWidth | TransferFunction | TransferRange
  | WhitePoint | XPositiohn | XResolution | YPosition
  | YResolution | YbCbCrCoefficients | YbCbCrPositioning)

NewSubfileType = 254
SubfileType = 255
ImageWidth = 256
ImageLength = 257
BitsPerSample = 258
Compression = 259
Compression.Uncompressed = 1
Compression.CCITT_1D  = 2
Compression.Group_3_Fax = 3
Compression.Group_4_Fax = 4
Compression.LZW = 5
Compression.JPEG = 6
Compression.PackBits = 32773
PhotometricInterpretation = 262
PhotometricInterpretation.WhiteIsZero = 0
PhotometricInterpretation.BlackIsZero = 1
PhotometricInterpretation.RGB = 2
PhotometricInterpretation.RGB_Palette = 3
PhotometricInterpretation.Transparency_mask = 4
PhotometricInterpretation.CMYK = 5
PhotometricInterpretation.YCbCr = 6
PhotometricInterpretation.CIELab = 8
Thresholding = 263
CellWidth = 264
CellLength = 265
FillOrder = 266
DocumentName = 269
ImageDescription = 270
Make = 271
Model = 272
StripOffsets = 273
Orientation = 274
SamplesPerPixel = 277
RowsPerStrip = 278
StripByteCounts = 279
MinSampleValue = 280
MaxSampleValue = 281
XResolution = 282
YResolution = 283
PlanarConfiguration = 284
PageName = 285
XPositiohn = 286
YPosition = 287
FreeOffsets = 288
FreeByteCounts = 289
GrayResponseUnit = 290
GrayResponseCurve = 291
T4Options = 292
T6Options = 293
ResolutionUnit = 296
PageNumber = 297
TransferFunction = 301
Software = 305
DateTime = 306
Artist = 315
HostComputer = 316
Predictor = 317
WhitePoint = 318
PrimaryChromaticities = 319
ColorMap = 320
HalftoneHints = 321
TileWidth = 322
TileLength = 323
TileOffsets = 324
TileByteCounts = 325
InkSet = 332
InkNames = 333
NumberOfInks = 334
DotRange = 336
TargetPrinter = 337
ExtraSamples = 338
SampleFormat = 339
SMinSampleValue = 340
SMaxSampleValue = 341
TransferRange = 342
JPEGProc = 512
JPEGInterchangeFormat = 513
JPEGInterchangeFormatLength = 514
JPEGRestartInterval = 515
JPEGLosslessPredictors = 517
JPEGPointTransforms = 518
JPEGQTables = 519
JPEGDCTables = 520
JPEGACTables = 521
YbCbCrCoefficients = 529
JbCbCrSubSampling = 530
YbCbCrPositioning = 531
ReferenceBlackWhite = 532
Copyright = 33432

Making a grammar for TIFF, part 1

I’ve been trying to come up with actual grammars for arbitrary binary file formats. The traditional ways are either “ad hoc” or “map it to a struct”. Neither works out well in practice; that it works at all is more a testament to hard work and tolerance of quirks than it is to correctness and usability.

I’m going to develop a grammar for the TIFF file format. I’m going off of this version of TIFF: TIFF Revision 6.0 Final June 3, 1992. The full grammar I’m working on can be found in this other post: TIFF 6.0 binary file grammar.

TIFF file grammar round 1

Here’s the grammar at a high level

grammar tiff "TIFF 6.0 1992-06-03"

tiff = tiff.header {endian tiff.bom} tiff.data
tiff.header = {bytes 2} tiff.bom {bytes 2} tiff.version
tiff.version = {endian tiff.bom} 42
tiff.bom = (0x49 0x49 {endian little} | 0x4d 0x4d {endian big})
tiff.data = ({bytes 4} offset {seek offset} ifd)* {bytes 4} 0
ifd = {bytes 2} entries {sort ascending tag} entry[entries]
entry = {bytes 2} tag {bytes 2} type {bytes 2} elems data
data = data.small | data.big
data.small = {?_esize <= 4} {bytes _esize} . {bytes 4-_esize} .
data.big = {bytes=4} offset {seek offset} {bytes _esize} .

Text in braces like {endian tiff.bom} are akin to semantic actions; they are not part of the syntax, they either have semantic meaning or are altering where or how data is parsed. We will think of them as semantic actions, because that’s how you’d do this kind of thing in a more classic parser. However, I’m going to define the syntax for semantic actions as opposed to leaving it as “write in some language like C”, because I want the semantic actions to be portable to any target, and my semantic actions have lots of magic in them.

Endianness

tiff = tiff.header {endian tiff.bom} tiff.data
tiff.version = {endian tiff.bom} 42
tiff.bom = (0x49 0x49 {endian little} | 0x4d 0x4d {endian big})

One valid way to look at binary data is as a string of bits. We could do this, but writing grammars like this would be impractically large. Instead, we want to work with data at a higher granularity, and that means we need to assign meaning, and the very first meaning has to do with how we arrange bits and bytes into values. Setting aside actual bit-level grammars for now and focusing on byte-level grammars, if we have a number larger than the range 0..255 (or -128..+127), we need to store it in more than one byte. That means that we need to set a byte order that is consistent between readers and writers of that number.

The TIFF format does not have a predetermined byte order. Instead, the byte order is recorded in an entry in the TIFF data header. In TIFF, the very first two bytes of the TIFF data must be

0x49 0x49

or

0x4d 0x4d

to mean little-endian or big-endian, respectively. I’m using the fairly common “0x” prefix to indicate values written in base 16, also known as hexadecimal, and I’m going to write one number for each 8-bit byte.

It turns out that in ASCII (and UTF-8, and Windows Latin-1, and many other character sets), 0x49 0x49 is “II”, and 0x4d 0x4d is “MM”. This was Adobe’s self-referential way of saying that “II” means that data is stored in the file in little-endian fashion, because the “I” refers to Intel, and that “MM” means big-endian, because “M” refers to Motorola. At the time of the spec, these were the two remaining giants who’d picked opposing endianness for their CPUs when processing data longer than 1 byte in length. For a brief moment in time, it looked like we could assume little-endian, but then ARM came along to challenge Intel, and now we have both again in credible amounts.

So there are two grammars for the rest of the file – one where all the multi-byte values are stored in little-endian fashion, and another where all the multi-byte values are stored in big-endian fashion. We have to either represent this directly in the grammar, or use some sort of short-cut. And, there are other file formats where the endianness can change within the file (for example, in EXIF, which used TIFF as a starting point, you can end up with a file where some data sections have little-endian data, and other data sections have big-endian data).

So I want to introduce a higher-level action that says “for this subset of the data” (and subset could be “entire file”), all multiple-byte values are big endian. My sketch above has some magic in it that I will write out more fully. Suffice it to say that {endian tiff.bom} says “set the endian of the data from this point in the sentence of the grammar to the endianness of the bom”.

Now, I need to set the endianness of the bom itself, and it might look something like this

tiff.bom = (0x49 0x49 {endian little} | 0x4d 0x4d {endian big})

so if I match the left side of the alternation, I am saying tiff.bom has the little-endian property, and if I match the right side, I’m saying that tiff.bom has the big-endian property. This would be extracted as seen above. At this point, little and big are magic predefined entities, and it’s likely that I’ll want to call that out with some syntax different from the (user-defined) terminals and non-terminals.

Semantically, I want endianness to be “block-oriented”, so endianness is specific to the rule (and sub-rules) that it is set in. That’s why you seen {endian tiff.bom} repeated several times.

An aside on verbosity

It’s going to seem like this is unecessarily verbose. It’s the opposite, I expect to end up with a full TIFF grammar that’s a fraction of the size of the spec (dozens of the 121 pages are devoted to describing the TIFF format), and the grammar will be precise, whereas the spec is not. More importantly, if you wanted TIFF load/save code for a new processor or a new language, it’s going to take you days to read the spec and turn it into working code. The point of having a grammar is that you just run the grammar through a new back-end and get code for your new language or platform.

Or imagine doing this on the fly – having a grammar file that is loaded by a program, thus giving that program (within limits) the ability to manipulate that file. And maybe that’s the way revisions of file formats are handled.

Also, you can compose grammars, and that is a key to why I am doing this in the first place.

Size of data

In text-based grammars, we assume there is a logical/abstract “character” and we let the lower-level code worry about how that maps to a physical stream of bytes/bits. In our binary grammars, we have to be explicit.

Looking at the second line in the test grammar, we see a {bytes 2} semantic action. That means that the next item is a 2-byte value. So this

header = {bytes 2} tiff.bom {bytes=2} tiff.version
tiff.version = {endian tiff.bom} 42
tiff.bom = (0x49 0x49 {endian little} | 0x4d 0x4d {endian big})

matches a two-byte bom, and a two-byte explicit decimal 42 in the endianness determined by the bom. This is both a long-winded and yet short way of saying that TIFF data starts with one of these two 4-byte sequences

49 49 2a 00
4d 4d 00 2a

The grammar approach is more verbose, and yet it’s not. There’s no meaning attributed to listing out sequences of bytes, so I would still need some rule to say “the first line means that the rest of the file has little-endian data”. Secondly, I have lost the ability to extend. Third, I want to generate file parsers automatically from my grammar, for any machine and target language.

We have tiff.version as a non-terminal rather than as a hard-coded 42 for two reasons: one, it calls it out as a semantically meaningful entity, and so that some day, a TIFF revision 6.x (or 7) could use a higher version number to allow for creating TIFF images bigger than 4 GB. In fact, this has been proposed, and it’s called bigTIFF.

Also note that the {bytes 2} action for bom is not really necessary, because I explicitly wrote out the bom pattern as two bytes. This was a stylistic choice aiding the reader of the grammar. I have left out {bytes 1} markers on 0x49 and 0x4d, because currently we aren’t doing bit-level grammars. These would need to be introduced once that happens, or we need some grammar-level instruction that says “this grammar has byte alignment”.

As an aside, I do intend to handle even bit-level binary files, say for example ZIP or RAR containers. This can be done with a simple extension, {bits 6}for example. But that means I really need to figure out how to declare literals, since literal size needs to be explicit somehow. I want to write bit strings as well as hex strings.

Random access

tiff.data = ({bytes 4} offset {seek offset} ifd)* {bytes 4} 0
offset = (1|..|2^32-1)
ifd = {bytes 2} entries {sort ascending tag} entry[entries]
entry = {bytes 2} tag {bytes 2} type {bytes 2} elems data

One big difference between this and a grammar for say C++ is that many binary file formats are random-access. You don’t read the file in sequential order starting with the first byte; instead, some bytes are addresses that point to other parts of the file. And at least equally important, the meaning of the file generally depends on the data in the file, so you can’t even parse it in a rudimentary fashion.

Take the common “catalog chunk” used in many binary file formats. The catalog chunk has information describing where data is, and also how to interpret it. You have to read and parse the catalog chunk first, even if the catalog is at the end of the file.

In the case of the TIFF format, there are start-relative pointers to data. Right after the four bytes of the bom and version, there is a 4-byte offset pointing to the start of the first IFD, or Image File Directory. TIFF data can have one or more IFDs in it, and each one describes an image. At the end of each IFD is an offset to the next IFD, and if the offset is 0, this is taken to be a terminator; no more IFDs are present in the file.

We can describe it as so

tiff.data = ({bytes 4} offset {seek offset} ifd)* {bytes 4} 0

There are zero or more pairs of offset and IFD, followed by a 4-byte 0. The layout is that there is a 4-byte offset that points to where the IFD is; in the line above, the offset and the IFD are not necessarily contiguous in memory. Instead, we introduce a semantic action to move the read cursor to offset (and later, we’ll talk about what this is an offset from).

Note that offset is a 4-byte integer stored with a specific endianness, but we decided that endianness is global for now, and we had set it after parsing the bom.

It’s possible that I’ll decide to introduce symbols for seek and endianness (and as we will get to later, we’ll see that endianness is simply a special case of the more encompassing idea of formatting data). I’ll stick with the function-name approach for now, as it is slightly more readable as we’re reasoning things out. The main thing to take away is that we have to be able to move about the data, either skipping forwards, or rewinding backwards.

Something that might stand out as a warning sign is “how do we know our data doesn’t have holes or overlaps?” This is a good question. The former just means waste, whereas the latter is almost certainly a problem. Now, aliasing is different, and that just means re-use, which isn’t a problem in and of itself. The answer here is that we have nothing for now; that’s a lint-like facility that we could use after parsing or generating data. Maybe there’s a way to bake it into the grammar itself; but if we did bake it into the grammar, we would reduce the universality of the grammar. Even without so reducing, the fact that we have a grammar means that we can have a single automated tool that can detect overlaps or holes for any grammar on any data.

There is a problem with the definition as written, and it is that offset cannot be zero, nor can it be odd – all data in a TIFF file is meant to be aligned on word (two-byte) bounds, probably as a nod to efficiency. We’ve been skating over the matching of the terminals; offset isn’t a terminal, it’s a non-terminal.

tiff = tiff.header tiff.data
tiff.data = ({bytes 4} offset {seek offset} ifd)* {bytes 4} 0
offset = (1|..|2^32-1)

And this has reminded us of another data formatting issue other than endianness, and that is that of signed versus unsigned numbers. Binary file formats contain not just 1, 2, 4 or 8-byte numbers (or beyond); they contain signed numbers, unsigned numbers, floating point, rationals, enumerations, strings of many different character sets, arrays, and more. It’s not unheard of to have specially constructed data types that don’t correspond to standard things.

The above definition side-stepped the idea of unsigned. It simply stated that offset is a 4-byte number that is in the range 1 to 2^32-1. In practice, this exactly maps to the idea of unsigned. However, I imagine we will need something more explicit than that.

And we’ll get to alignment at some point.

Dynamic data

Now we get to the hard part, or at least slightly more complicated than we’ve been doing to date.

An IFD (Image File Directory) is a structure consisting of a two-byte count of the number of directory entries, and then sequence of that many directory entries, where each entry is a 12-byte composite of type, tag, count and value or offset to value.

ifd = {bytes 2} entries entry[entries]
entry = {bytes 2} tag {bytes 2} type {bytes 2} elems data

As you can see, I’m using a typical array notation to indicate “repeat N times”. As a simple definition, this covers the image directory itself, but doesn’t quite handle the values.

The value of elems is not the number of bytes in the value, it’s the number of elements of the basic type. The type field specifies the basic type, and the data is an array of elems items of that type. For example, if type was 3, this would be “unsigned short”, a 2-byte datum, and if elems was 4, this would mean 8 bytes of data, comprising a list of 4 unsigned shorts.

The second complication is that offset is only an offset for data that is longer than 4 bytes. Data 4 bytes or less in length is packed into the offset field, left-justified. It’s still subject to the endianness declared in the header, however.

TIFF 6.0 has 12 types; signed and unsigned 1-byte, 2-byte, and 4-byte integers, signed and unsigned 8-byte rational, ASCII zero-terminated string (actually can be multiple strings, each zero-terminated; 4-byte float, 8-byte float, and a “undefined” which is really 8-byte octets that have no predetermined data type. The pedagogy of the latter is interesting, since in practice there is no difference between the “octet” type (a better name than undefined) and the unsigned 1-byte integer.

So an individual element’s size is determined by the type field, in a somewhat arbitrary way. Maybe we could express it like this

type = (type_1B | type_2B | type_4B | type_8B | type_U)
type_1B = (1 | 2 | 6 | 7) {_esize=1}
type_2B = (3 | 8) {_esize=2}
type_4B = (4 | 9 | 11) {_esize=4}
type_8B = (5 | 10 | 12) {_esize=8}
type_U = (0|13..|65535) {_esize=0}

We have a type_U because the TIFF spec states “readers should skip over fields containing an unexpected field type”. Alternatively, we could declare this to be an error. If I had my druthers, I would declare it an error, but write my parser to recover from errors.

Then we could use our semantic _esize value to calculate the size of the data, and then use that when matching to decide if we have embedded data or an offset to the data.

data = data.small | data.big
data.small = {?_esize <= 4} {bytes _esize} . {bytes 4-_esize} .
data.big = {bytes 4} offset {seek offset} {bytes _esize} .

Our data is either embedded in the IFD entry, or pointed to by the offset contained in the entry. I need some sort of predicate to say “match here if the computed size is 4 bytes or less”, and in reality I’ll use something from one of the functional languages. I use “.” here to mean “match anything”, but it’s inconspicuous enough that I’ll probably switch to something else.

There is a problem here, and that is the issue of random access and the file cursor. I haven’t said anything about localizing the cursor, but I am moving it in one case and not the other. In either event, I need the cursor change to be temporary when it fetches the data from the offset, because the cursor needs to be left pointing to the start of the next record. I think it’s enough to say that cursor seeks are localized to the rule they are operating in, so at the end of the rule, the cursor is restore to where it would be if the seek hadn’t happened. I’ll need to write out some cursor actions to make sure that is the case.

Alternatively, we could introduce push and pop operations, although I’m less thrilled about that.

Enumerations

We’re almost done, but the last part is a lot of typing. The tag field in the IFD entry is the field’s name, which is also its meaning. There are 74 defined tags in the TIFF file format. While each field has a low-level type format, that’s only for storage; each tag is free to have a higher-level meaning for its data. For example, ImageWidth and ImageHeight store the width and height of the image in pixels, and you can’t process the TIFF file without reading and understanding these. Compression is actually an enum of 7 different compression types.

In principle, it’s straightforward

tag = (NewSubfileType | SubfileType | .. | Copyright)
NewSubfileType = 254 { stuff here }
SubfileType = 255 { stuff here }
Copyright = 33432 { stuff here }

You would do something with the data at this point, if you were generating a parser from the grammar. For TIFF data, you’d find the RowsPerStrip, StripOffsets and StripByteCounts fields, which point to the actual compressed data for strips of pixels, along with other important tags to help you render the data properly.

This is understating this section, because the TIFF file format subsumes the format of TIFF images, which needs to be described. However, there are no more new elements needed to describe TIFF data, once you have predicates, sizes and endianness.

There are several thousand EXIF tags, by the way. Still, it’s just typing, and it’s far better to have everything explicit, and documented.

Constraints

There are some constraints we haven’t put in yet. The entries in an IFD are meant to be sorted by tag number, in ascending order. Presumably the thought here was for binary search, although the number of entries is rarely enough that binary search would make sense.

Another constraint is that you can only have unique tags in an IFD. I don’t see this written into the spec anywhere, but I’m pretty sure it would have to be this way, or you would not be able to read or parse the entire file.

Another constraint is that valid files require certain tags.

Another constraint is that, if there are multiple images in the TIFF, the first one must be the high-resolution image. Subsequent ones are generally reduced-resolution, say for thumbnails.

The TIFF spec disallows aliasing of data; all data can be referenced from one place only.

Partial file parsing

After all of this, I want to return to one of the implications of why binary files formats usually use random-access; many binary files are designed so that you don’t need to read the entire file. For example, if you have a texture stored as a set of mipmaps, you only want one of the mimaps; reading (or parsing) the other ones would be a waste of time.

Having a recognizer generated from a grammar does not mean that the recognizer has to parse everything. The onus on us, however, is to make sure that the grammars are written in a way that recognizers can take advantage of in skipping over data that is not needed.

 

Schema evolution in data serialization schemes

This was an interesting article, well worth reading

Schema evolution in Avro, Protocol Buffers and Thrift

Protocol Buffers came from Google, and is very widely used.

Thrift is a project that started at Facebook but was turned over to Apache to run. The white paper gives a good introduction. Thrift is an entire RPC framework that includes data serialization.

Avro is another Apache project, but more narrowly focused on data serialization, and it relies on explicitly present schemas (the schema used for writing must be presented to the reader).

 

A litany of Exif woes

The Exif file format is, shall we say, not robust and not documented well. And yet it’s at the heart of hundreds of billions of data items – the photos that people take with digital cameras in ever-increasing numbers.

Strictly speaking, Exif data is not needed; it’s metadata about your image, not the image itself. And yet, sans that metadata, many photographers and their tools are worse off – the metadata stores timestamps, GPS locations, camera information, and more.

Where things go wrong is in importing or editing. For example, Apple’s iPhoto adds its own metadata when importing photos from cameras; so do most other importing tools. Photoshop needs to copy or remove metadata when you modify a photo. Social media sites are currently in a tiny bit of hot water for their active removing of metadata from photos uploaded to their sites.

Changing byte order

One of the worst things you can do is to change the byte-order in Exif metadata. Or at least, this is one that is fraught with peril, because there are a lot of specialized formats buried inside Exif, and if you fix up some but not all of the offsets, you’ll break metadata; not for you, because you’re skipping that bit for failing to understand it, but for some other tool more savvy than yours (or knowledgeable in a different way).

For example, I had taken some photos with a Nikon E990 in 2007. In looking at them recently (while trying to develop a more comprehensive JPEG file parser), I found that iTunes had decided to change an Exif chunk from little-endian to big-endian. It fixed up all the data it knew about, but evidently this older version of iTunes did not know about Nikon MakerData. And so it left the MakerData alone, which meant that its offsets were still little-endian. Photos I’d imported previously with Nikon’s custom software were fine, because Nikon of course knows how to handle its own metadata.

I don’t actually know if the Nikon E990 writes Exif as little-endian or big-endian, because I only have photos that were imported with a tool, not copied directly from its media. That is an interesting experiment I need to run at some point.

Maker Note – no specific format

The Maker Note data itself is very problematic, because there is no one format for it. About half the cameras in existence seem to write their Maker Note data in IFD format, and half have a custom format. This makes it very hard to do anything with the whole file, because MakerNote data offsets are absolute offsets, not relative to the MakerNote chunk, and there is actually nothing forcing Maker Note data to be stored inside the declared MakerNote chunk (although one suspects it usually is). So if you edit a Exif file in a way that causes the location of the MakerNote data to shift, this has the high likelihood of breaking the MakerNote data, since its embedded offsets are now pointing to the wrong place. Microsoft attempted to fix this, but their fix was useless.

What other people have said

Problems with current Metadata Standards. Phil Harvey is the author of ExifTool, the most comprehensive (if a bit too ad-hoc) Exif-type metadata tool around. Also, the ExifTool FAQ has tidbits of information in it about Exif-format issues.

MakerNote formats and specifications.

EXIF Maker Notes.

TIFF, Tag Image File Format, FAQ from Aware Systems has the defining line about the TIFF file format (which Exif piggy-backed onto).

Except for one more thing: the TIFF specification explicitly forbids tag data to include any offsets into the file or into any other TIFF data block, except for the documented special cases (TileOffsets, StripOffsets,…). This enables total relocatability of all structures and data blocks. That is a major corner stone of the format. It means it is actually easy to, e.g. unlink an IFD or concatenate multiple single-page TIFF to a single multi-page TIFF, or vice versa. If any tag’s data could contain offsets pointing anywhere in the file, then software doing this or otherwise relocating data blocks should be aware of the exact nature of every tag’s data in order to find all data blocks, and know what pointers ought to be changed. That is unfeasible, on the one hand, due to the number of defined tags and, on the other hand, it inhibits extendability and private tags.

 

Instead, the specification says that all tag data needs to be ‘self-contained’, and that only a selected few special tags are allowed to point to other locations in the file. Thus, all blocks become freely relocatable, can be read and written out in any order, and any software can quite simply joggle around all this TIFF data, with only inbuilt knowledge of these highest level structures, and of the selected few special tags.

If Exif had followed this, then we wouldn’t be having the problems currently faced. I presume that the original spec authors were thinking of Exif files as write-once magic that would not be edited (very short-sighted). On the other hand, TIFF files are a flat structure, basically an array of independent IFD elements. Sometimes structure is desired.

The (much later) suggestion was to have an IFD TIFF type that pointed to a self-contained IFD data block, which would basically be its own embedded TIFF file (with its own byte-orientation, and block-relative offsets).

The actual answer is a new file format that is backwards-compatible and forwards-compatible. This is a pretty daunting thing to imagine, but given the ever-increasing number of Exif files that are generated each year, something that we should do.

Nikon MakerNote Tags

 

Profiling Perl code with NYTProf

The absolute best way to profile Perl code is to use Devel::NYTProf.

Step 1. Install it, because it’s not part of the standard distribution (it should be). It’s called Devel:NYTProf, and is in ActiveState’s ppm as well as the more standard cpan.

Step 2. Use it

perl -d:NYTProf <command line to run your program>
nytprofhtml --open

This will give you a nice HTML-based profile of the execution of your program. Among the notable features are a treemap and a number of ways to drill down into the hotspots. It will even generate DOT files for rendering in Graphviz, although these are not always useful.

The perldoc on CPAN is here: Devel::NYTProf.

I’ll put a real profile up soon of one of my programs, but for now here’s an example of one that I found in a Google search: C:/lo/libo-master/solenv/bin/make_installer.pl

Here is a post by Tim Bunce on how the updated NYTProf came into existence: NYTProf v2 – the background story. And there’s a 2008 screen cast. And there is a blog posting on the latest version.