Category Archives: Parsing

Out-of-order parsing

More and more, I think parsing needs to be modernized. Parsing was the hot subject in the 1960s, and may very well have been synonymous with computer science up through the early 1970s, but has stayed in that form ever since.

We can do better, and we have to relax the restrictions imposed by thinking of parsed texts as strings.

Yes, Tomita was a big step (http://pdf.aminer.org/000/211/233/recognition_of_context_free_and_stack_languages.pdf), but that was 30 years ago (!!) and a pretty small step, in hindsight.

Visual Studio 2012 solution file grammar

Here is the start for a grammar for sln files in Visual Studio 2012. Since the solution file format hasn’t changed much since Visual Studio 2005, a little parameterization probably handles all projects, and I’ve attempted to do that here. Also, I’ve made non-terminals as short as possible without killing readability, in the interests of making something that’s legible in just a few columns. And I really need to respect the whitespace indentation rules – the parser in Visual Studio doesn’t force them on input, but does follow indentation rules on output, so I may as well bake them into the grammar somehow.

Also note that MSBuild has an internal parser for sln files that you can access with some work – see Getting Visual Studio version of a Solution file. I have no idea if there is a formal grammar with a parser, or if Microsoft only has ad-hoc parsers. I’m guessing the latter…

sln = ver-header entries
entries = (project | global)+

ver-header = visual-studio ", " format-version "\n" comment-version "\n"
visual-studio = "Microsoft Visual Studio Solution File"
format-version = "Format Version numeric-version
numeric-version = nv_2002 | nv_2003 | nv_2005 | nv_2008 | nv_2010 | nv_2012
nv_2002 = "7.00"
nv_2003 = "8.00"
nv_2005 = "9.00"
nv_2008 = "10.00"
nv_2010 = "11.00"
nv_2012 = "12.00"
comment-version = "# Visual Studio " year-version
year_version = "2005" | "2008" | "2010" | "2012"

project = project-id "\n" project-parms* "EndProject" "\n"
project-id = "Project(" '"{' pt-guid '}"' )" = " proj_parms
proj_parms = proj-name ", " proj-path ", " proj_guid
pt-guid = vcppguid | vbguid | vcsharpguid | webguid | siguid
vcppguid = "8BC9CEB8-8B4A-11D0-8D11-00A0C91BC942"
vbguid = "F184B08F-C81C-45F6-A57F-5ABD9991F28F"
vcsharpguid = "FAE04EC0-301F-11D3-BF4B-00C04F79EFBC"
webguid = "E24C65DC-7377-472b-9ABA-BC803B73C61A"
siguid = "2150E333-8FDC-42A3-9474-1A3956D46DE8"
proj_name = '"' STRING '"'
proj_path = '"' STRING '"'
proj_guid = HEX{8} "-" HEX{4} "-" HEX{4} "-" HEX{4} "-" HEX{12}

global = "Global" "\n" global-section* "EndGlobal" "\n"
global_section = gs_start "\n" gs_body "EndGlobalSection" "\n"
gs_start = "GlobalSection" "(" gs_kind ")" " = " gs_when
gs_kind =   "SolutionConfigurationPlatforms"
          | "ProjectConfigurationPlatform"
          | "SolutionProperties"
          | "ExtensibilityGlobals"
          | "ExtensibilityAddIns"
gs_kind_05 =   "SolutionConfiguration"
             | "ProjectConfiguration"
             | "ProjectDependencies"
gs_when = "preSolution" | "postSolution"
gs_body = gs_configs | gs_plats | gs_props
gs_configs = (gs_config "\n")*
gs_config = sol_cfg_plat " = " sol_cfg_plat
sol_cfg_plat = sol_cfg "|" sol_plat
sol_cfg = "Debug" | "Release" | STRING
sol_plat = "Win32" | "x64" | STRING
gs_plats = (gs_plat "\n")*
gs_plat = "{" proj_guid "}" "." sol_cfg_plat "." gs_action " = " sol_cfg_plat
gs_action = "ActiveCfg | "Build.0"
gs_props = "HideSolutionNode = " ("TRUE" | "FALSE") "\n"

This is horribly incomplete and horribly informal. I need to work more in individual elements and write the grammar fragments for them. The above would parse a subset of solution files, but it would generate many illegal solution files. Also, there are some elements that are deprecated and probably converted to newer elements. I noted a few of those in the gs_kind_05 field for elements that stopped existing at some point after VS 2005.

This is what an trivial wizard-generated solution file looks like in Visual Studio 2012, that has one project. I have split lines in some cases using a C-preprocessor ‘\’ syntax, in order to not have line wrapping, and I made short GUIDs for the same reason. Visual Studio probably needs the exact Microsoft GUID format – a 128-bit GUID split into 5 chunks of 32-16-16-16-48 bits, respectively. For completeness’ sake, I turned 8BC9CEB8-8B4A-11D0-8D11-00A0C91BC942 into 8BC9-8D11-00A0C91B, and 67DC3F40-5C82-445C-8EE9-0B78724D839D into 67DC-8EE9-0B78724D.

Microsoft Visual Studio Solution File, Format Version 12.00
# Visual Studio 2012
Project("{8BC9-8D11-00A0C91B}") = "vc12-test-1-empty", \
        "vc12-test-1-empty\vc12-test-1-empty.vcxproj", \
        "{67DC-8EE9-0B78724D}"
EndProject
Global
    GlobalSection(SolutionConfigurationPlatforms) = preSolution
        Debug|Win32 = Debug|Win32
        Release|Win32 = Release|Win32
    EndGlobalSection
    GlobalSection(ProjectConfigurationPlatforms) = postSolution
        {67DC-8EE9-0B78724D}.Debug|Win32.ActiveCfg = Debug|Win32
        {67DC-8EE9-0B78724D}.Debug|Win32.Build.0 = Debug|Win32
        {67DC-8EE9-0B78724D}.Release|Win32.ActiveCfg = Release|Win32
        {67DC-8EE9-0B78724D}.Release|Win32.Build.0 = Release|Win32
    EndGlobalSection
    GlobalSection(SolutionProperties) = preSolution
        HideSolutionNode = FALSE
    EndGlobalSection
EndGlobal

Some notes

For SolutionConfigurationPlatforms, I can’t figure out what’s going on here. I’m guessing that one side is a label that’s used in places, and the other side is the actual name of a config. I’m guessing that there’s some unification going on here for projects that have different subsets of configurations? I tried editing one side or the other, but Visual Studio must be “repairing damage”, because it puts things back the way it wants it.

For Project, I can’t figure out what the proj-name item is actually for. I don’t know what uses it, because it doesn’t control what’s displayed in Visual Studio (the name from the vcxproj file itself is used).

The values for GlobalSection are actually in the registry, in a key named SolutionPersistence for the relevant version of Visual Studio. Each entry just has a GUID, which in turn probably points to a DLL that handles that value? There are 20 of these in my installs of VS 2010 and VS 2012, and even the deprecated VS 2005 variants are listed, so maybe they aren’t deprecated?

Visual Studio 2002, which is “Format Version 7.00″, and Visual Studio 2003, which is “Format Version 8.00″, did not have the second comment line with the Visual Studio year entry.

Project dependencies are in a ProjectSection inside a Project entry, like so:

Project("{8BC9-8D11-00A0C91B}") = "Tool", "Tool.vcproj", "{0C1B-BCF4-01F038A7}"
    ProjectSection(ProjectDependencies) = postProject
        {6678-A354-3A61F442} = {6678-A354-3A61F442}
    EndProjectSection
EndProject

The dependencies are always “project depends on itself”, instead of “B depends on A”. It’s possible that there’s some other dependency allowed?

Reference

Hack the Project and Solution Files

Premake generates sln files, but in an ad-hoc fashion. It doesn’t read them.

Solution (.sln) File from MSDN – old, it’s for VS 2005, but still mostly relevant.

Grammars for arbitrary bitstrings

Note: this was an older direction that I might return to.

Grammars are defined on a string, but that “string” is usually a subset of all possible strings; e.g. it is usually a string of “human-readable characters”. This bias is due to the fact that grammars and parsing started with languages like English and Fortran. Instead, what we want is a way to write a precise grammar for any language, not just languages expressed as human-readable characters. Also, we want grammars, because they can both be used as recognizers and generators, and potentially recognize/parse files in linear time.

To avoid encoding shortcuts (and thus limiting the expressiveness of our grammars), we want to define grammars on bitstrings. Thus, we only need two terminals, 0 and 1. We’ll build up some common primitives, but they are for convenience only.

However, it’s easy to define grammars as being on bitstrings, albeit a little more cumbersome. Our alphabet is just the two 1-bit values 0 and 1. We’re going to start with the syntax of ABNF, as clumsy as that is (although we’ll abandon it soon, since we’ll want the freedom to declare what a character in a string actually means). We might define a 1-byte value and a 4-byte value like this:

OCTET = 8*8(0|1)
DWORD = 4*4OCTET

Note that in neither definition do we require that an OCTET or DWORD starts on an 8-bit bounds. Padding has to be explicit in our grammars.

But right off the bat, we have issues with bit-endianness and byte-endianness once we want to interpret values (which we need to do in order to make an actual language). Yes, an OCTET is 8 consecutive bits from the string, but if you want to reason about it as a value in the range 0..255, you would need to decide if the subset bitstring is big-endian (most significant bit is leftmost in the bitstring) or little-endian. And likewise, a DWORD is 4 consecutive OCTETs, but are those organized in a big-endian fashion or little-endian fashion.

We could tediously construct our values (with semantic actions)

b0 = (0|1)
b1 = (0|1)
b2 = (0|1)
b3 = (0|1)
b4 = (0|1)
b5 = (0|1)
b6 = (0|1)
b7 = (0|1)
OCTECT_BE = b7 b6 b5 b4 b3 b2 b1 b0 { $$ = ($1 << 7) | ... ($8); }

Or something like that. It’s likely that we will quickly fall back on some predefined primitives, but it’s nice to know that we can define all the primitives ourselves. We might even want to steal the “little” and “big” prefixes from DataScript, so we might simply have

. = (0|1)
OCTET = big ........

and then the nonterminal’s value is converted as a big-endian bitstring.

Another challenge is that grammars involve a left-to-right parse, but many files’ interpretation involves moving about in the bitstring. A sequence of bits early on might be an offset to a different section in the bitstring that needs to be understood before more bits can be parsed.

Appendix

Why not ASN.1? Because it’s not suitable. It’s complex, and yet incomplete. It separates abstract notation from encoding, which we don’t want (we want to keep them together), and it can’t handle random-access formats.

Why not BFF (“A Grammar for Binary File Formats”)? It limits itself to a bytestring, not a bitstring.

Something intriguing: Synalyze It!. (sigh, people who put exclamation marks in names). In looking at the “grammar” for ZIP files, it falls back to a fair amount of LUA. It’s really a decompiler, not a grammar; I’m not seeing how this can be used to generate strings (e.g. legal ZIP files). But still, it’s interesting.

Another interesting commercial candidate is 010 Editor, with the brag “Edit Anything”. It’s an interesting approach, use C structs with embedded semantic actions. Again, not a grammar, not generative. It’s really a specialized language for parsing binary data.

Links

Parsing of Context Sensitive Grammars

Earley Parsing for Context-Sensitive Grammars

Sample Parsers from Parsing Techniques

The Convergence of Mildly Context-Sensitive Grammar Formalisms

Genetic programming with context-sensitive grammars

A new method for dependent parsing (Yakker)

Van Wijngaarden grammar

xtc (eXTensible Compiler) project.

PacketTypes: Abstract Specification of Network Protocol Messages

 

TIFF 6.0 binary file grammar

This is the entire reference grammar, referred to by other posts. Or rather, this will be the entire grammar, I will keep updating it until it is complete.

grammar tiff "TIFF 6.0 1992-06-03"

tiff = tiff.header {endian tiff.bom} tiff.data
tiff.header = {bytes 2} tiff.bom {bytes 2} tiff.version
tiff.version = {endian tiff.bom} 42
tiff.bom = (0x49 0x49 {endian little} | 0x4d 0x4d {endian big})

tiff.data = ({bytes 4} offset {seek offset} ifd)* {bytes 4} 0
offset = (1|..|2^^32-1)

ifd = {bytes 2} entries {sort ascending tag} entry[entries]
entry = {bytes 2} tag {bytes 2} type {bytes 2} elems data

type = (type_1B | type_2B | type_4B | type_8B | type_U)
type_1B = (1 | 2 | 6 | 7) {_esize=1}
type_2B = (3 | 8) {_esize=2}
type_4B = (4 | 9 | 11) {_esize=4}
type_8B = (5 | 10 | 12) {_esize=8}
type_U = (0|13|..|65535) {_esize=0}

data = data.small | data.big
data.small = {?_esize <= 4} {bytes _esize} . {bytes 4-_esize} .
data.big = {bytes 4} offset {seek offset} {bytes _esize} .

tag = (Artist | BitsPerSample | CellLength | CellWidth
  | ColorMap | Compression | Copyright | DateTime | DocumentName
  | DotRange | ExtraSamples | FillOrder | FreeByteCounts
  | FreeOffsets | GrayResponseCurve | GrayResponseUnit
  | HalftoneHints | HostComputer | ImageDescription | ImageLength
  | ImageWidth | InkNames | InkSet | JPEGACTables | JPEGDCTables
  | JPEGInterchangeFormat | JPEGInterchangeFormatLength
  | JPEGLosslessPredictors | JPEGPointTransforms | JPEGProc
  | JPEGQTables | JPEGRestartInterval | JbCbCrSubSampling | Make
  | MaxSampleValue | MinSampleValue | Model | NewSubfileType
  | NumberOfInks | Orientation | PageName | PageNumber
  | PhotometricInterpretation | PlanarConfiguration | Predictor
  | PrimaryChromaticities | ReferenceBlackWhite | ResolutionUnit
  | RowsPerStrip | SMaxSampleValue | SMinSampleValue
  | SampleFormat | SamplesPerPixel | Software | StripByteCounts
  | StripOffsets | SubfileType | T4Options | T6Options
  | TargetPrinter | Thresholding | TileByteCounts | TileLength
  | TileOffsets | TileWidth | TransferFunction | TransferRange
  | WhitePoint | XPositiohn | XResolution | YPosition
  | YResolution | YbCbCrCoefficients | YbCbCrPositioning)

NewSubfileType = 254
SubfileType = 255
ImageWidth = 256
ImageLength = 257
BitsPerSample = 258
Compression = 259
Compression.Uncompressed = 1
Compression.CCITT_1D  = 2
Compression.Group_3_Fax = 3
Compression.Group_4_Fax = 4
Compression.LZW = 5
Compression.JPEG = 6
Compression.PackBits = 32773
PhotometricInterpretation = 262
PhotometricInterpretation.WhiteIsZero = 0
PhotometricInterpretation.BlackIsZero = 1
PhotometricInterpretation.RGB = 2
PhotometricInterpretation.RGB_Palette = 3
PhotometricInterpretation.Transparency_mask = 4
PhotometricInterpretation.CMYK = 5
PhotometricInterpretation.YCbCr = 6
PhotometricInterpretation.CIELab = 8
Thresholding = 263
CellWidth = 264
CellLength = 265
FillOrder = 266
DocumentName = 269
ImageDescription = 270
Make = 271
Model = 272
StripOffsets = 273
Orientation = 274
SamplesPerPixel = 277
RowsPerStrip = 278
StripByteCounts = 279
MinSampleValue = 280
MaxSampleValue = 281
XResolution = 282
YResolution = 283
PlanarConfiguration = 284
PageName = 285
XPositiohn = 286
YPosition = 287
FreeOffsets = 288
FreeByteCounts = 289
GrayResponseUnit = 290
GrayResponseCurve = 291
T4Options = 292
T6Options = 293
ResolutionUnit = 296
PageNumber = 297
TransferFunction = 301
Software = 305
DateTime = 306
Artist = 315
HostComputer = 316
Predictor = 317
WhitePoint = 318
PrimaryChromaticities = 319
ColorMap = 320
HalftoneHints = 321
TileWidth = 322
TileLength = 323
TileOffsets = 324
TileByteCounts = 325
InkSet = 332
InkNames = 333
NumberOfInks = 334
DotRange = 336
TargetPrinter = 337
ExtraSamples = 338
SampleFormat = 339
SMinSampleValue = 340
SMaxSampleValue = 341
TransferRange = 342
JPEGProc = 512
JPEGInterchangeFormat = 513
JPEGInterchangeFormatLength = 514
JPEGRestartInterval = 515
JPEGLosslessPredictors = 517
JPEGPointTransforms = 518
JPEGQTables = 519
JPEGDCTables = 520
JPEGACTables = 521
YbCbCrCoefficients = 529
JbCbCrSubSampling = 530
YbCbCrPositioning = 531
ReferenceBlackWhite = 532
Copyright = 33432

Making a grammar for TIFF, part 1

I’ve been trying to come up with actual grammars for arbitrary binary file formats. The traditional ways are either “ad hoc” or “map it to a struct”. Neither works out well in practice; that it works at all is more a testament to hard work and tolerance of quirks than it is to correctness and usability.

I’m going to develop a grammar for the TIFF file format. I’m going off of this version of TIFF: TIFF Revision 6.0 Final June 3, 1992. The full grammar I’m working on can be found in this other post: TIFF 6.0 binary file grammar.

TIFF file grammar round 1

Here’s the grammar at a high level

grammar tiff "TIFF 6.0 1992-06-03"

tiff = tiff.header {endian tiff.bom} tiff.data
tiff.header = {bytes 2} tiff.bom {bytes 2} tiff.version
tiff.version = {endian tiff.bom} 42
tiff.bom = (0x49 0x49 {endian little} | 0x4d 0x4d {endian big})
tiff.data = ({bytes 4} offset {seek offset} ifd)* {bytes 4} 0
ifd = {bytes 2} entries {sort ascending tag} entry[entries]
entry = {bytes 2} tag {bytes 2} type {bytes 2} elems data
data = data.small | data.big
data.small = {?_esize <= 4} {bytes _esize} . {bytes 4-_esize} .
data.big = {bytes=4} offset {seek offset} {bytes _esize} .

Text in braces like {endian tiff.bom} are akin to semantic actions; they are not part of the syntax, they either have semantic meaning or are altering where or how data is parsed. We will think of them as semantic actions, because that’s how you’d do this kind of thing in a more classic parser. However, I’m going to define the syntax for semantic actions as opposed to leaving it as “write in some language like C”, because I want the semantic actions to be portable to any target, and my semantic actions have lots of magic in them.

Endianness

tiff = tiff.header {endian tiff.bom} tiff.data
tiff.version = {endian tiff.bom} 42
tiff.bom = (0x49 0x49 {endian little} | 0x4d 0x4d {endian big})

One valid way to look at binary data is as a string of bits. We could do this, but writing grammars like this would be impractically large. Instead, we want to work with data at a higher granularity, and that means we need to assign meaning, and the very first meaning has to do with how we arrange bits and bytes into values. Setting aside actual bit-level grammars for now and focusing on byte-level grammars, if we have a number larger than the range 0..255 (or -128..+127), we need to store it in more than one byte. That means that we need to set a byte order that is consistent between readers and writers of that number.

The TIFF format does not have a predetermined byte order. Instead, the byte order is recorded in an entry in the TIFF data header. In TIFF, the very first two bytes of the TIFF data must be

0x49 0x49

or

0x4d 0x4d

to mean little-endian or big-endian, respectively. I’m using the fairly common “0x” prefix to indicate values written in base 16, also known as hexadecimal, and I’m going to write one number for each 8-bit byte.

It turns out that in ASCII (and UTF-8, and Windows Latin-1, and many other character sets), 0x49 0x49 is “II”, and 0x4d 0x4d is “MM”. This was Adobe’s self-referential way of saying that “II” means that data is stored in the file in little-endian fashion, because the “I” refers to Intel, and that “MM” means big-endian, because “M” refers to Motorola. At the time of the spec, these were the two remaining giants who’d picked opposing endianness for their CPUs when processing data longer than 1 byte in length. For a brief moment in time, it looked like we could assume little-endian, but then ARM came along to challenge Intel, and now we have both again in credible amounts.

So there are two grammars for the rest of the file – one where all the multi-byte values are stored in little-endian fashion, and another where all the multi-byte values are stored in big-endian fashion. We have to either represent this directly in the grammar, or use some sort of short-cut. And, there are other file formats where the endianness can change within the file (for example, in EXIF, which used TIFF as a starting point, you can end up with a file where some data sections have little-endian data, and other data sections have big-endian data).

So I want to introduce a higher-level action that says “for this subset of the data” (and subset could be “entire file”), all multiple-byte values are big endian. My sketch above has some magic in it that I will write out more fully. Suffice it to say that {endian tiff.bom} says “set the endian of the data from this point in the sentence of the grammar to the endianness of the bom”.

Now, I need to set the endianness of the bom itself, and it might look something like this

tiff.bom = (0x49 0x49 {endian little} | 0x4d 0x4d {endian big})

so if I match the left side of the alternation, I am saying tiff.bom has the little-endian property, and if I match the right side, I’m saying that tiff.bom has the big-endian property. This would be extracted as seen above. At this point, little and big are magic predefined entities, and it’s likely that I’ll want to call that out with some syntax different from the (user-defined) terminals and non-terminals.

Semantically, I want endianness to be “block-oriented”, so endianness is specific to the rule (and sub-rules) that it is set in. That’s why you seen {endian tiff.bom} repeated several times.

An aside on verbosity

It’s going to seem like this is unecessarily verbose. It’s the opposite, I expect to end up with a full TIFF grammar that’s a fraction of the size of the spec (dozens of the 121 pages are devoted to describing the TIFF format), and the grammar will be precise, whereas the spec is not. More importantly, if you wanted TIFF load/save code for a new processor or a new language, it’s going to take you days to read the spec and turn it into working code. The point of having a grammar is that you just run the grammar through a new back-end and get code for your new language or platform.

Or imagine doing this on the fly – having a grammar file that is loaded by a program, thus giving that program (within limits) the ability to manipulate that file. And maybe that’s the way revisions of file formats are handled.

Also, you can compose grammars, and that is a key to why I am doing this in the first place.

Size of data

In text-based grammars, we assume there is a logical/abstract “character” and we let the lower-level code worry about how that maps to a physical stream of bytes/bits. In our binary grammars, we have to be explicit.

Looking at the second line in the test grammar, we see a {bytes 2} semantic action. That means that the next item is a 2-byte value. So this

header = {bytes 2} tiff.bom {bytes=2} tiff.version
tiff.version = {endian tiff.bom} 42
tiff.bom = (0x49 0x49 {endian little} | 0x4d 0x4d {endian big})

matches a two-byte bom, and a two-byte explicit decimal 42 in the endianness determined by the bom. This is both a long-winded and yet short way of saying that TIFF data starts with one of these two 4-byte sequences

49 49 2a 00
4d 4d 00 2a

The grammar approach is more verbose, and yet it’s not. There’s no meaning attributed to listing out sequences of bytes, so I would still need some rule to say “the first line means that the rest of the file has little-endian data”. Secondly, I have lost the ability to extend. Third, I want to generate file parsers automatically from my grammar, for any machine and target language.

We have tiff.version as a non-terminal rather than as a hard-coded 42 for two reasons: one, it calls it out as a semantically meaningful entity, and so that some day, a TIFF revision 6.x (or 7) could use a higher version number to allow for creating TIFF images bigger than 4 GB. In fact, this has been proposed, and it’s called bigTIFF.

Also note that the {bytes 2} action for bom is not really necessary, because I explicitly wrote out the bom pattern as two bytes. This was a stylistic choice aiding the reader of the grammar. I have left out {bytes 1} markers on 0x49 and 0x4d, because currently we aren’t doing bit-level grammars. These would need to be introduced once that happens, or we need some grammar-level instruction that says “this grammar has byte alignment”.

As an aside, I do intend to handle even bit-level binary files, say for example ZIP or RAR containers. This can be done with a simple extension, {bits 6}for example. But that means I really need to figure out how to declare literals, since literal size needs to be explicit somehow. I want to write bit strings as well as hex strings.

Random access

tiff.data = ({bytes 4} offset {seek offset} ifd)* {bytes 4} 0
offset = (1|..|2^32-1)
ifd = {bytes 2} entries {sort ascending tag} entry[entries]
entry = {bytes 2} tag {bytes 2} type {bytes 2} elems data

One big difference between this and a grammar for say C++ is that many binary file formats are random-access. You don’t read the file in sequential order starting with the first byte; instead, some bytes are addresses that point to other parts of the file. And at least equally important, the meaning of the file generally depends on the data in the file, so you can’t even parse it in a rudimentary fashion.

Take the common “catalog chunk” used in many binary file formats. The catalog chunk has information describing where data is, and also how to interpret it. You have to read and parse the catalog chunk first, even if the catalog is at the end of the file.

In the case of the TIFF format, there are start-relative pointers to data. Right after the four bytes of the bom and version, there is a 4-byte offset pointing to the start of the first IFD, or Image File Directory. TIFF data can have one or more IFDs in it, and each one describes an image. At the end of each IFD is an offset to the next IFD, and if the offset is 0, this is taken to be a terminator; no more IFDs are present in the file.

We can describe it as so

tiff.data = ({bytes 4} offset {seek offset} ifd)* {bytes 4} 0

There are zero or more pairs of offset and IFD, followed by a 4-byte 0. The layout is that there is a 4-byte offset that points to where the IFD is; in the line above, the offset and the IFD are not necessarily contiguous in memory. Instead, we introduce a semantic action to move the read cursor to offset (and later, we’ll talk about what this is an offset from).

Note that offset is a 4-byte integer stored with a specific endianness, but we decided that endianness is global for now, and we had set it after parsing the bom.

It’s possible that I’ll decide to introduce symbols for seek and endianness (and as we will get to later, we’ll see that endianness is simply a special case of the more encompassing idea of formatting data). I’ll stick with the function-name approach for now, as it is slightly more readable as we’re reasoning things out. The main thing to take away is that we have to be able to move about the data, either skipping forwards, or rewinding backwards.

Something that might stand out as a warning sign is “how do we know our data doesn’t have holes or overlaps?” This is a good question. The former just means waste, whereas the latter is almost certainly a problem. Now, aliasing is different, and that just means re-use, which isn’t a problem in and of itself. The answer here is that we have nothing for now; that’s a lint-like facility that we could use after parsing or generating data. Maybe there’s a way to bake it into the grammar itself; but if we did bake it into the grammar, we would reduce the universality of the grammar. Even without so reducing, the fact that we have a grammar means that we can have a single automated tool that can detect overlaps or holes for any grammar on any data.

There is a problem with the definition as written, and it is that offset cannot be zero, nor can it be odd – all data in a TIFF file is meant to be aligned on word (two-byte) bounds, probably as a nod to efficiency. We’ve been skating over the matching of the terminals; offset isn’t a terminal, it’s a non-terminal.

tiff = tiff.header tiff.data
tiff.data = ({bytes 4} offset {seek offset} ifd)* {bytes 4} 0
offset = (1|..|2^32-1)

And this has reminded us of another data formatting issue other than endianness, and that is that of signed versus unsigned numbers. Binary file formats contain not just 1, 2, 4 or 8-byte numbers (or beyond); they contain signed numbers, unsigned numbers, floating point, rationals, enumerations, strings of many different character sets, arrays, and more. It’s not unheard of to have specially constructed data types that don’t correspond to standard things.

The above definition side-stepped the idea of unsigned. It simply stated that offset is a 4-byte number that is in the range 1 to 2^32-1. In practice, this exactly maps to the idea of unsigned. However, I imagine we will need something more explicit than that.

And we’ll get to alignment at some point.

Dynamic data

Now we get to the hard part, or at least slightly more complicated than we’ve been doing to date.

An IFD (Image File Directory) is a structure consisting of a two-byte count of the number of directory entries, and then sequence of that many directory entries, where each entry is a 12-byte composite of type, tag, count and value or offset to value.

ifd = {bytes 2} entries entry[entries]
entry = {bytes 2} tag {bytes 2} type {bytes 2} elems data

As you can see, I’m using a typical array notation to indicate “repeat N times”. As a simple definition, this covers the image directory itself, but doesn’t quite handle the values.

The value of elems is not the number of bytes in the value, it’s the number of elements of the basic type. The type field specifies the basic type, and the data is an array of elems items of that type. For example, if type was 3, this would be “unsigned short”, a 2-byte datum, and if elems was 4, this would mean 8 bytes of data, comprising a list of 4 unsigned shorts.

The second complication is that offset is only an offset for data that is longer than 4 bytes. Data 4 bytes or less in length is packed into the offset field, left-justified. It’s still subject to the endianness declared in the header, however.

TIFF 6.0 has 12 types; signed and unsigned 1-byte, 2-byte, and 4-byte integers, signed and unsigned 8-byte rational, ASCII zero-terminated string (actually can be multiple strings, each zero-terminated; 4-byte float, 8-byte float, and a “undefined” which is really 8-byte octets that have no predetermined data type. The pedagogy of the latter is interesting, since in practice there is no difference between the “octet” type (a better name than undefined) and the unsigned 1-byte integer.

So an individual element’s size is determined by the type field, in a somewhat arbitrary way. Maybe we could express it like this

type = (type_1B | type_2B | type_4B | type_8B | type_U)
type_1B = (1 | 2 | 6 | 7) {_esize=1}
type_2B = (3 | 8) {_esize=2}
type_4B = (4 | 9 | 11) {_esize=4}
type_8B = (5 | 10 | 12) {_esize=8}
type_U = (0|13..|65535) {_esize=0}

We have a type_U because the TIFF spec states “readers should skip over fields containing an unexpected field type”. Alternatively, we could declare this to be an error. If I had my druthers, I would declare it an error, but write my parser to recover from errors.

Then we could use our semantic _esize value to calculate the size of the data, and then use that when matching to decide if we have embedded data or an offset to the data.

data = data.small | data.big
data.small = {?_esize <= 4} {bytes _esize} . {bytes 4-_esize} .
data.big = {bytes 4} offset {seek offset} {bytes _esize} .

Our data is either embedded in the IFD entry, or pointed to by the offset contained in the entry. I need some sort of predicate to say “match here if the computed size is 4 bytes or less”, and in reality I’ll use something from one of the functional languages. I use “.” here to mean “match anything”, but it’s inconspicuous enough that I’ll probably switch to something else.

There is a problem here, and that is the issue of random access and the file cursor. I haven’t said anything about localizing the cursor, but I am moving it in one case and not the other. In either event, I need the cursor change to be temporary when it fetches the data from the offset, because the cursor needs to be left pointing to the start of the next record. I think it’s enough to say that cursor seeks are localized to the rule they are operating in, so at the end of the rule, the cursor is restore to where it would be if the seek hadn’t happened. I’ll need to write out some cursor actions to make sure that is the case.

Alternatively, we could introduce push and pop operations, although I’m less thrilled about that.

Enumerations

We’re almost done, but the last part is a lot of typing. The tag field in the IFD entry is the field’s name, which is also its meaning. There are 74 defined tags in the TIFF file format. While each field has a low-level type format, that’s only for storage; each tag is free to have a higher-level meaning for its data. For example, ImageWidth and ImageHeight store the width and height of the image in pixels, and you can’t process the TIFF file without reading and understanding these. Compression is actually an enum of 7 different compression types.

In principle, it’s straightforward

tag = (NewSubfileType | SubfileType | .. | Copyright)
NewSubfileType = 254 { stuff here }
SubfileType = 255 { stuff here }
Copyright = 33432 { stuff here }

You would do something with the data at this point, if you were generating a parser from the grammar. For TIFF data, you’d find the RowsPerStrip, StripOffsets and StripByteCounts fields, which point to the actual compressed data for strips of pixels, along with other important tags to help you render the data properly.

This is understating this section, because the TIFF file format subsumes the format of TIFF images, which needs to be described. However, there are no more new elements needed to describe TIFF data, once you have predicates, sizes and endianness.

There are several thousand EXIF tags, by the way. Still, it’s just typing, and it’s far better to have everything explicit, and documented.

Constraints

There are some constraints we haven’t put in yet. The entries in an IFD are meant to be sorted by tag number, in ascending order. Presumably the thought here was for binary search, although the number of entries is rarely enough that binary search would make sense.

Another constraint is that you can only have unique tags in an IFD. I don’t see this written into the spec anywhere, but I’m pretty sure it would have to be this way, or you would not be able to read or parse the entire file.

Another constraint is that valid files require certain tags.

Another constraint is that, if there are multiple images in the TIFF, the first one must be the high-resolution image. Subsequent ones are generally reduced-resolution, say for thumbnails.

The TIFF spec disallows aliasing of data; all data can be referenced from one place only.

Partial file parsing

After all of this, I want to return to one of the implications of why binary files formats usually use random-access; many binary files are designed so that you don’t need to read the entire file. For example, if you have a texture stored as a set of mipmaps, you only want one of the mimaps; reading (or parsing) the other ones would be a waste of time.

Having a recognizer generated from a grammar does not mean that the recognizer has to parse everything. The onus on us, however, is to make sure that the grammars are written in a way that recognizers can take advantage of in skipping over data that is not needed.

 

A litany of Exif woes

The Exif file format is, shall we say, not robust and not documented well. And yet it’s at the heart of hundreds of billions of data items – the photos that people take with digital cameras in ever-increasing numbers.

Strictly speaking, Exif data is not needed; it’s metadata about your image, not the image itself. And yet, sans that metadata, many photographers and their tools are worse off – the metadata stores timestamps, GPS locations, camera information, and more.

Where things go wrong is in importing or editing. For example, Apple’s iPhoto adds its own metadata when importing photos from cameras; so do most other importing tools. Photoshop needs to copy or remove metadata when you modify a photo. Social media sites are currently in a tiny bit of hot water for their active removing of metadata from photos uploaded to their sites.

Changing byte order

One of the worst things you can do is to change the byte-order in Exif metadata. Or at least, this is one that is fraught with peril, because there are a lot of specialized formats buried inside Exif, and if you fix up some but not all of the offsets, you’ll break metadata; not for you, because you’re skipping that bit for failing to understand it, but for some other tool more savvy than yours (or knowledgeable in a different way).

For example, I had taken some photos with a Nikon E990 in 2007. In looking at them recently (while trying to develop a more comprehensive JPEG file parser), I found that iTunes had decided to change an Exif chunk from little-endian to big-endian. It fixed up all the data it knew about, but evidently this older version of iTunes did not know about Nikon MakerData. And so it left the MakerData alone, which meant that its offsets were still little-endian. Photos I’d imported previously with Nikon’s custom software were fine, because Nikon of course knows how to handle its own metadata.

I don’t actually know if the Nikon E990 writes Exif as little-endian or big-endian, because I only have photos that were imported with a tool, not copied directly from its media. That is an interesting experiment I need to run at some point.

Maker Note – no specific format

The Maker Note data itself is very problematic, because there is no one format for it. About half the cameras in existence seem to write their Maker Note data in IFD format, and half have a custom format. This makes it very hard to do anything with the whole file, because MakerNote data offsets are absolute offsets, not relative to the MakerNote chunk, and there is actually nothing forcing Maker Note data to be stored inside the declared MakerNote chunk (although one suspects it usually is). So if you edit a Exif file in a way that causes the location of the MakerNote data to shift, this has the high likelihood of breaking the MakerNote data, since its embedded offsets are now pointing to the wrong place. Microsoft attempted to fix this, but their fix was useless.

What other people have said

Problems with current Metadata Standards. Phil Harvey is the author of ExifTool, the most comprehensive (if a bit too ad-hoc) Exif-type metadata tool around. Also, the ExifTool FAQ has tidbits of information in it about Exif-format issues.

MakerNote formats and specifications.

EXIF Maker Notes.

TIFF, Tag Image File Format, FAQ from Aware Systems has the defining line about the TIFF file format (which Exif piggy-backed onto).

Except for one more thing: the TIFF specification explicitly forbids tag data to include any offsets into the file or into any other TIFF data block, except for the documented special cases (TileOffsets, StripOffsets,…). This enables total relocatability of all structures and data blocks. That is a major corner stone of the format. It means it is actually easy to, e.g. unlink an IFD or concatenate multiple single-page TIFF to a single multi-page TIFF, or vice versa. If any tag’s data could contain offsets pointing anywhere in the file, then software doing this or otherwise relocating data blocks should be aware of the exact nature of every tag’s data in order to find all data blocks, and know what pointers ought to be changed. That is unfeasible, on the one hand, due to the number of defined tags and, on the other hand, it inhibits extendability and private tags.

 

Instead, the specification says that all tag data needs to be ‘self-contained’, and that only a selected few special tags are allowed to point to other locations in the file. Thus, all blocks become freely relocatable, can be read and written out in any order, and any software can quite simply joggle around all this TIFF data, with only inbuilt knowledge of these highest level structures, and of the selected few special tags.

If Exif had followed this, then we wouldn’t be having the problems currently faced. I presume that the original spec authors were thinking of Exif files as write-once magic that would not be edited (very short-sighted). On the other hand, TIFF files are a flat structure, basically an array of independent IFD elements. Sometimes structure is desired.

The (much later) suggestion was to have an IFD TIFF type that pointed to a self-contained IFD data block, which would basically be its own embedded TIFF file (with its own byte-orientation, and block-relative offsets).

The actual answer is a new file format that is backwards-compatible and forwards-compatible. This is a pretty daunting thing to imagine, but given the ever-increasing number of Exif files that are generated each year, something that we should do.

Nikon MakerNote Tags

 

Using Rewriting Logic to specify program semantics

An Executable Semantic Definition of the Beta Language using Rewriting Logic.

Rewriting logic specifications are an attempt to provide easy and expressive ways to develop executable formal definitions of languages. Among other benefits is that these can be analyzed by automated tools. The Maude Project is one of the main groups looking into rewriting logic specifications. The paper Twenty Years of Rewriting Logic is a good overview and summary.

BETA is an object-oriented programming language that numbers Simula-67 and Smalltalk among its spiritual ancestors. A 1993 book is now available in PDF: Object-Oriented Programming in the Beta Programming Language.

A monster regex to validate email addresses

RFC 2822 – Internet Message Format (which obsoleted/replaced RFC 822, but consider them the same thing). It specifies the format of an Internet Message, which to the rest of us is an email message.

One piece of RFC 2822 specifies the format of email addresses. Someone wrote a Perl regex to validate email addresses according to RFC 822. It is nightmarish

Mail::RFC822::Address: regexp-based address validation

I’ll reproduce 10% of it here.

(?:(?:\r\n)?[ \t])*(?:(?:(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t]
)+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:
\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(
?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ 
\t]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\0
31]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\
](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+
(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:

Yes, the full regex is 82 lines long. And it doesn’t handle the full spec, it does not handle comments in email addresses (comments can be nested, and a regex is an expression of a regular grammar, which can’t count, so can’t handle nesting). This was not generated by hand, of course, it was generated by Perl code combining multiple simpler regular expressions together.

By comparison, the grammar for email addresses, including the bits of the foundational grammar, are about 30 lines, and readable. The actual email address grammar is 17 lines of ABNF, and it looks like there’s around 15 lines of base elements (like “comment” and “FWS” (folding white space).

So why do people use regular expressions? Because, with current tools, it’s faster to write them, and they execute faster too. The “time to write” factor is so severe that you see people using ad-hoc (and broken) regular expression parsing even in areas where a real parser would be ideal.

Now, also, this is an example of getting into trouble in the first place by not having a grammar – email addresses evolved ad-hoc, and grammars were retrofitted to them. But still, the point is that the goal should be to make high-level and high-powered parsing techniques really easy to use and apply.

Parsing bibliography

Not complete, but papers that I found useful or interesting.

2011 – Automated Evaluation of Syntax Error Recovery, Maartje de Jonge and Eelco Visser. The paper develops an automated method for parsers to recover when syntax errors are encountered.

2011 – An Algorithm for Layout Preservation in Refactoring Transformations, Maartje de Jonge, Eelco Visser. The paper develops an algorithm for preserving source even while manipulating the program as an abstract syntax tree, allowing for more comprehensive refatoring techniques.

2011 – Growing a Language Environment with Editor Libraries, Sebastian Erdweg, Lennart C. L. Kats, Tilman Rendel, Christian Kastner, Klaus Ostermann, Eelco Visser. The paper introduces the idea of extending IDEs through editor libraries, allowing syntatic and semantic features such as syntax coloring or reference following for arbitrary languages.

2011 – Integrated Language Definition Testing Enabling Test-Driven Language Development, Lennart C. L. Kats, Rob Vermaas, Eelco Visser. The paper promotes the idea and use of a language-parametric testing language (LPTL) to do reusable and generic tests of language definitions. Basically unit tests and test-driven-development for language design and implementation.

2010 – Pure and Declarative Syntax Definition: Paradise Lost and Regained, Lennart C. L. Kats, Eelco Visser, Guido Wachsmuth. The paper advocates for scannerless parsing, for ambiguity, for grammars written as naturally as possible, and thus for generalized parsing to be used.

2010 – The Spoofax Language Workbench, Lennart C. L. Kats, Eelco Visser. Spoofax is a language workbench based on Eclipse for developing DSLs, providing a comprehensive environment that integrates syntax definition, program transformation, code generation, and declarative specification of IDE components.

2010 – The Spoofax Language Workbench: Rules for Declarative Specification of Languages and IDEs, Lennart C. L. Kats, Eelco Visser. Implementation details of Spoofax.

Software Engineering Research Group Technical Reports