LL Parsing

Again, write some stuff.

LL parsing’s appeal is largely because you can hand-write an LL parser, even for a complex language like C++ (it’s definitely not practical and maybe not even possible to hand-write an LR parser for most languages). Now, even though it’s intuitive, the intuition is I think wrong, because the parsing algorithm in your brain (for reading English, for example) is almost certainly not top-down (insert references to recent experiments). But it certainly is simpler, and one reason it’s simpler is that LL grammars are a subset of LR grammars (which are in turn a subset of all context-free grammars).

More specifically, if you have an EBNF grammar, you can almost write the LL parser from the grammar just by turning non-terminals into function calls, and terminals into matching against the input.

Another appeal for LL parsing is that it’s easier to do useful compiler errors and error recovery. This may again be due to not fully understanding LR parsing in terms of errors. And this is probably due to the bias against ambiguous grammars (mathematicians dislike ambiguity).

Fisher and LeBlanc “Crafting a Compiler” has a good summary of LL versus LR.

http://cs.stackexchange.com/questions/43/language-theoretic-comparison-of-ll-and-lr-grammars has some notes about theoretic bounds of complexity for LL and LR grammars. LL is a subset of LR(1).

blacc – the Burk Labs Anachronistic Compiler Compiler is “yet another parser generator”. This one is pretty small, has some good ideas, and is LL(1) with extensions (e.g. does left corner transformations so that left recursion handling is algorithmic instead of ad-hoc).

SID (from the TenDRA project) is another LL(1) parser generator.

Of course, the most famous LL parser generator is ANTLR (originally PCCTS), which is on its fourth major redesign. And Terence Parr is a vociferous defender of LL parsing: http://compilers.iecc.com/comparch/article/92-05-059 and http://www.antlr.org/article/needlook.html.



GLL Parsing

GLL Parsing is Generalized LL Parsing, which is a generalized parser using LL parsing (left-to-right input, leftmost derivation). GLL parsing allows left-recursion. LL has several advantages over LR (easier to understand, can be written by a person, can be implemented with recursive descent). However, LL by itself is fairly limited, it doesn’t allow left-recursion (see Limitations of LL(1) Parsing).

What looks like a fundamental paper came out in 2010 from Elizabeth Scott and Adrian Johnstone: GLL Parsing

Recursive Descent (RD) parsers are popular because their control flow follows the structure of the grammar and hence they are easy to write and to debug. However, the class of grammars which admit RD parsers is very limited. Backtracking techniques may be used to extend this class, but can have explosive runtimes and cannot deal with grammars with left recursion. Tomita-style RNGLR parsers are fully general but are based on LR techniques and do not have the direct relationship with the grammar that an RD parser has. We develop the fully general GLL parsing technique which is recursive descent-like, and has the property that the parse follows closely the structure of the grammar rules, but uses RNGLR-like machinery to handle non determinism. The resulting recognisers run in worst-case cubic time and can be built even for left recursive grammars.

(the entire proceedings of the conference this paper was introduced at: Preliminary Proceedings of the Ninth Workshop on Language Descriptions Tools and Applications LDTA 2009 – page 113).

Another recent thesis uses GLL parsers to facilitate the design and development of domain specific languages (DSLs):

Language embedding (extending a language by embedding another language within it) is one big arena for generalized parsing techniques.

Daniel Spiewak has started doing work with GLL parsers – see his series of articles

Laurie Tratt has been writing about parsing lately. He wrote a very quoted paper (“Parsing: The Solved Problem that Isn’t” in 2011) saying that grammar composition is important, that ambiguity is important. He also wrote a paper showing how to do left-recursion safely in PEG parsing, even though he seems to think that PEG is not the way to go.

Ben Basten wrote a paper on ambiguity detection methods for context-free grammars, which is useful once you start using generalized parsing techniques (since it is easier to end up with unwanted ambiguities in your grammar). LL and LR don’t tolerate ambiguity, which is both good and bad – it’s a straightjacket, but it’s convenient. There’s probably some proof showing that good grammars are ambiguous, but we only want certain kinds of ambiguity, the kinds that lead to more expressiveness than the confusion they can introduce.

Revision control systems

This is more about theory and implementation of revision control systems, and not really about use. I’m interested in various concepts came into being and how they evolved.

One early system (that died quite thoroughly, evidently it never got used much) was OpenCM, billed as a “secure and safe CVS replacement”. I found a copy of a user’s manual from 2002 that describes some of the concepts (I’ll see if I can mirror it locally before it too disappears into obscurity).

OpenCM User’s Guide

It hit upon the idea of giving each file a universal name, but it does so by generating “random” names based on your machine name, so it loses some of the benefit of doing content-based names (like using the SHA-1 of the file contents as Monotone, Git and Mercurial do).

Here’s a snapshot of what existed in version-control land back in 2007, at least as far as open source went.

Appendix A. Free Version Control Systems from Producing Open-Source Software by Karl Fogel.

Here’s a slideshow history of revision control: http://www.win.tue.nl/~aserebre/2IS55/2009-2010/stijn.pdf. Well, somewhat – it skips 90% of revision control systems and only talks about the open-source ones. This may be appropriate, since there hasn’t been a lot of cross-pollination from the closed-source revision-control systems.

Monotone’s first release was created by Graydon Hoare and released on April 6, 2003 (according to LWN and Wikipedia). Monotone was rejected by Linus Torvalds as being too slow, and this led directly to the creation of Git.

Veracity is Eric Sink’s replacement for SourceGear. http://veracity-scm.com/

Petr Baudis’ Bachelor Thesis (2008) was on Current Concepts in Version Control Systems. He contributed a lot to Git development, starting days after Linus Torvald’s first release by building a front end for Git (git-pasky, later Cogito, and then folded some of it into the Git core).

NFTS Alternate Data Streams

NTFS Streams were introduced in Windows NT 3.1 to enable Services for Macintosh (SFM) to store Macintosh resource forks and finder information. At a technical level, the implementation was cool, because it was a generic solution that could be expanded. Microsoft did very little with it over the years, until very recently. It’s been a source of annoyance and even used as a vector for viruses. You can write an executable to an alternative data stream of any file, and then even execute it, and the stock Windows file system tools don’t really acknowledge the existence of alternative data streams: explorer.exe and others just show the size of the default stream.

Sadly, Microsoft has abandoned SFM as of Windows Server 2008, but third parties such as ExtremeZ-IP still offer support, by using alternate data streams. Mac literature for this refers to it just as “named streams”.

Mac OS X v10.5 and up writes Mac metadata and resource forks to named streams now, instead of using AppleDouble ._<filename> files. You can enable or disable this as a default, or per mount point.

There is one bug I’ve seen so far – if you’re copying a file with no data fork, this confuses Windows, because it doesn’t try to create the default stream before attaching the alternative data stream. The one case where I’ve seen this happen is in aliases copied to netatalk servers. The netatalk server stores its data using .AppleDouble folders, because it  expects to run on file systems that lack support for multiple streams.


Resource forks


NTFS Alternate data streams


AppleSingle and AppleDouble file formats

AppleSingle and AppleDouble are methods to save the extra Mac metadata on systems that otherwise could not (SMB, NFS, NTFS, ext2, etc). AppleSingle and AppleDouble are the same file format; the only difference is that the data fork is stored as a separate file in AppleDouble and included in the AppleSingle file. As such, AppleSingle is more intended for archive or transmission, whereas AppleDouble allows for non-Mac applications to easily access the data fork, ignoring the metadata. Since most applications store their file’s data in the data fork, especially for anything considered to be cross-platform, this means AppleDouble is a fairly useful system.

There are two conventions for AppleDouble files’ naming. The data fork is generally stored with the name of the file itself. The metadata is stored either in a folder named .AppleDouble parallel with the file (and the metadata itself named the same as the file inside that folder) or the metadata is stored in a file name prefixed with “._” next to the file itself. In the .AppleDouble folder case, the .AppleDouble folders also generally have the .DS_Store file (contains directory metadata unique to a Mac) and a .Parent file, which is also a AppleSingle-format file containing metadata about the parent directory itself (name, dates, icon, finder info). Interestingly, the .DS_Store file is also a AppleSingle-format file, and I’m not sure why there are both. Time to look in the source?

Here’s the appledouble.h header file from opensource.apple.com (pretty sure I can reproduce this in full since it’s under Apple’s open-source license):

/* Information pulled from:
 * "AppleSingle/AppleDouble Formats for Foreign Files Developer's Note"
 * (c) Apple Computer 1990
 * File assembled by Rob Braun (bbraun@synack.net)

#ifndef __APPLEDOUBLE__
#define __APPLEDOUBLE__

#include <sys/types.h>

/* Structure of an AppleSingle file:
 *   ----------------------
 *   | AppleSingleHeader  |
 *   |--------------------|
 *   | ASH.entries # of   |
 *   | AppleSingleEntry   |
 *   | Descriptors        |
 *   |         1          |
 *   |         .          |
 *   |         .          |
 *   |         n          |
 *   |--------------------|
 *   |   Datablock 1      |
 *   |--------------------|
 *   |   Datablock 2      |
 *   |--------------------|
 *   |   Datablock n      |
 *   ----------------------

struct AppleSingleHeader {
	uint32_t     magic;       /* Magic Number (0x00051600 for AS) */
	uint32_t     version;     /* Version #.  0x00020000 */
	char         filler[16];  /* All zeros */
	uint16_t     entries;     /* Number of entries in the file */

#define XAR_ASH_SIZE 26   /* sizeof(struct AppleSingleHeader) will be wrong
                           * due to padding. */

#define APPLESINGLE_MAGIC 0x00051600
#define APPLEDOUBLE_MAGIC 0x00051607

#define APPLESINGLE_VERSION 0x00020000
#define APPLEDOUBLE_VERSION 0x00020000

struct AppleSingleEntry {
	uint32_t     entry_id;    /* What the entry is.  See defines below */
	uint32_t     offset;      /* offset of data, offset beginning of file */
	uint32_t     length;      /* length of data.  can be 0 */

/* Valid entry_id values */
/* Entries 1, 3, and 8 are typically created for all files.
 * Macintosh Icon entries are rare, since those are typically in the resource 
 * fork.
#define AS_ID_DATA       1  /* Data fork */
#define AS_ID_RESOURCE   2  /* Resource fork */
#define AS_ID_NAME       3  /* Name of the file */
#define AS_ID_COMMENT    4  /* Standard Macintosh comment */
#define AS_ID_BWICON     5  /* Standard Macintosh B&W icon */
#define AS_ID_COLORICON  6  /* Standard Macintosh Color icon */
/* There is no 7 */
#define AS_ID_DATES      8  /* File creation date, modification date, etc. */
#define AS_ID_FINDER     9  /* Finder Information */
#define AS_ID_MAC       10  /* Macintosh File information, attributes, etc. */
#define AS_ID_PRODOS    11  /* ProDOS file information */
#define AS_ID_MSDOS     12  /* MS-DOS file information */
#define AS_ID_SHORTNAME 13  /* AFP short name */
#define AS_ID_AFPINFO   14  /* AFP file information */
#define AS_ID_AFPDIR    15  /* AFP directory id */
/* 1-0x7FFFFFFF are reserved by Apple */

/* File Dates are stored as the # of seconds before or after
 * 12am Jan 1, 2000 GMT.  The default value is 0x80000000.
struct MacTimes {
	uint32_t  creation;
	uint32_t  modification;
	uint32_t  backup;
	uint32_t  access;

/* Finder Information is two 16 byte quantities. 
 * Newly created files have all 0's in both entries.

/* Macintosh File Info entry (10) a 32 bit bitmask. */

/* Entries can be placed in any order, although Apple recommends:
 * Place the data block (1) last.
 * Finder Info, File Dates Info, and Macintosh File Info first.
 * Allocate resource for entries in 4K blocks.

/* AppleDouble files are simply AppleSingle files without the data fork.
 * The magic number is different as a read optimization. 

#endif /* __APPLEDOUBLE__ */

The old tech note describing the AppleSingle and AppleDouble file formats is AppleSingle/AppleDouble Formats: Developer’s Note, but the URL has moved several times, so I’ll try to not rely on it. And this should all maybe go in Wikipedia?

I’m going to write a quick AppleSingle disassembler. Here it is

# =======================================================================================
# dump-applesingle.pl
# disassemble an AppleSingle file to text.
# Maybe I should call this dump-appledouble since that's the more likely file to
# encounter?
# =======================================================================================

use 5.014; # also does strict
use warnings;
use feature ':5.14'; # I think this is redundant?

use Getopt::Long qw();


# =======================================================================================
# =======================================================================================

package Dump;
use parent -norequire, 'Object';

# ---------------------------------------------------------------------------------------

sub run
    my ($self) = @_;

    $self->{'option'} = Option->new()->read_options();
    $self->{'diag'} = Diag->new()->init($self->{'option'}->{'verbose'});


sub dump
    my ($self) = @_;

    foreach my $file (@{$self->{'option'}->{'files'}})
        $self->{'file'} = $file;
        $self->{'diag'}->status("Parsing $file\n");
        local $/ = undef;
        open FILE, "{'blob'} = ;
        close FILE;

sub dump_blob
    my ($self) = @_;

    my $header = substr($self->{'blob'}, 0, 26);
    my @header = unpack("L> L> x[16] S>", $header);
    my $magic = $header[0] == 0x51600 ? "APPLESINGLE_MAGIC"
                : $header[0] == 0x51607 ? "APPLEDOUBLE_MAGIC"
                : "** ERROR **";
    my $version = $header[1] == 0x20000 ? "APPLESINGLE_VERSION" : "** ERROR **";

    my $blob_length = length($self->{'blob'});
    print "\n$self->{'file'}\nlength=$blob_length\n";
    print         "      AppleSingleHeader\n";
    print         "      {\n";
    print sprintf("%04X:     uint32_t  magic      = 0x%08X  $magic\n", 0, $header[0]);
    print sprintf("%04X:     uint32_t  version    = 0x%08X  $version\n", 4, $header[1]);
    print sprintf("%04X:     char      filler[16] = {0}\n", 8);
    print sprintf("%04X:     uint16_t  entries    = %d\n", 24, $header[2]);
    print         "      }\n";

    my $count = $header[2];
    my @descriptors = unpack("(L> L> L>)[$count]", substr($self->{'blob'}, 26, $count * 12));

    # some of these are from the netatalk source, their adouble.h file

    my @annotated_offsets;
    print "\n";
    foreach my $i (0 .. $count-1)
        print "        entry[$i] =\n";
        print "        {\n";

        my $addr = 26 + 12 * $i;
        my $entry_id = $descriptors[3*$i + 0];
        my $entry_id_str = ($entry_id == 1) ? " (AS_ID_DATA)"
                            : ($entry_id == 2) ? " (AS_ID_RESOURCE)"
                            : ($entry_id == 3) ? " (AS_ID_NAME)"
                            : ($entry_id == 4) ? " (AS_ID_COMMENT)"
                            : ($entry_id == 5) ? " (AS_ID_BWICON)"
                            : ($entry_id == 6) ? " (AS_ID_COLORICON)"
                            : ($entry_id == 8) ? " (AS_ID_DATES)"
                            : ($entry_id == 9) ? " (AS_ID_FINDER)"
                            : ($entry_id == 10) ? " (AS_ID_MAC)"
                            : ($entry_id == 11) ? " (AS_ID_PRODOS)"
                            : ($entry_id == 12) ? " (AS_ID_MSDOS)"
                            : ($entry_id == 13) ? " (AS_ID_SHORTNAME)"
                            : ($entry_id == 14) ? " (AS_ID_AFPINFO)"
                            : ($entry_id == 15) ? " (AS_ID_AFPDIR)"
                            : ($entry_id == 0x80444556) ? " (AD_DEV - netatalk)"
                            : ($entry_id == 0x80494E4F) ? " (AD_INO - netatalk)"
                            : ($entry_id == 0x8053594E) ? " (AD_SYN - netatalk)"
                            : ($entry_id == 0x8053567E) ? " (AD_ID - netatalk)" : "";
        my $offset = $descriptors[3*$i + 1];
        my $length = $descriptors[3*$i + 2];

        my $entry_val_and_label = ($entry_id < 16)                 ? sprintf("%d$entry_id_str", $entry_id)                 : sprintf("0x%08X$entry_id_str", $entry_id);                  push @annotated_offsets, [$offset, $length, $entry_val_and_label];                  print sprintf("%04X:       uint32_t  entry_id = $entry_val_and_label\n", $addr + 0);         print sprintf("%04X:       uint32_t  offset   = 0x%08X\n", $addr + 4, $offset);         print sprintf("%04X:       uint32_t  length   = %d\n", $addr + 8, $length);         print "        }\n";     }          my @sorted_offsets = sort { $a->[0]  $b->[0]} @annotated_offsets;
    my $pos = 26 + $count * 12;

    my $bytes = 0;
    my $need_newline = 0;
    while ($pos < $blob_length)     {         # if we have run into one of the objects, print its name         if ($pos == $sorted_offsets[0]->[0])
            print "\n" if $need_newline;
            $need_newline = 0;
            $bytes = 0;
            print sprintf("\n        $sorted_offsets[0]->[2]  length = %d\n",  $sorted_offsets[0]->[1]);
            shift @sorted_offsets;

        if ($bytes == 16) { print "\n"; $need_newline = 0; $bytes = 0; }
        print sprintf("%04X:", $pos) unless $need_newline;
        print sprintf(" %02X", unpack("C", substr($self->{'blob'}, $pos, 1)));
        $need_newline = 1;

        $bytes += 1;
        $pos += 1;

# =======================================================================================
# Diag - diagnostic output
# char, line: print if verbose >= 3 (-v -v)
# status: print if verbose >= 2 (-v)
# progress: print if verbose > 0 (no option)
# =======================================================================================

package Diag;
use parent -norequire, 'Object';

sub init
    my $self = shift;
    $self->{'diag'} = $_[0] // 0;
    $self->{'need_newline'} = 0;
    $self->{'last_progress'} = 0;
    return $self;

sub char
    my ($self, $char) = @_;
    return unless $self->{'diag'} > 2;

    print STDERR $char;

sub line
    my ($self, $line) = @_;
    return unless $self->{'diag'} > 2;

    print STDERR "\n" and $self->{'need_newline'} = 0 if $self->{'need_newline'};
    print STDERR $line;
    $self->{'need_newline'} = 1 unless substr($line, -1) eq "\n";

sub status
    my ($self, $line) = @_;
    return unless $self->{'diag'} > 1;

    print STDERR "\n" and $self->{'need_newline'} = 0 if $self->{'need_newline'};
    print STDERR $line;
    $self->{'need_newline'} = 1 unless substr($line, -1) eq "\n";

# =======================================================================================
# Option - parse command-line options
# =======================================================================================

package Option;
use parent -norequire, 'Object';

sub read_options
    my ($self) = @_;

    my @opts = ('quiet|q', 'verbose|v+');
    $self->{'verbose'} = 1;

    my $status = Getopt::Long::GetOptions($self, @opts);
    @{$self->{'files'}} = @ARGV; # anything left is a file operand
    $self->{'verbose'} = 0 if $self->{'quiet'};

    if ($self->{'verbose'} > 2)
        print STDERR "options:\n";
        map { print STDERR "  $_ = $self->{$_}\n"; } sort keys %$self;

    return $self;

# =======================================================================================
# Object - base class
# =======================================================================================

package Object;

sub new
    my $self = shift;
    my ($class) = ref($self) || $self; # allow both virtual (member) and static (class)
    $self = {};
    bless $self, $class;

    return $self;

# =======================================================================================

=head1 NAME

dump-applesingle: disassemble AppleSingle/AppleDouble files


dump-applesingle [options] [args]

dump-applesingle -v ._appledoublefile


dump-applesingle disassembles AppleSingle and AppleDouble files into a textual
representation. Currently that's all it does, but it will probably evolve into
something to do bulk operations on them (find, change, delete, create etc)


I guess I need to (1) get some WordPress plugins to format code better and (2) change my theme so that people can view code better (narrow margins are good for text but not quiet as good for code).

More links

Windows 7 tweaks

Enable administrative shares on Windows 7

Add this registry key. This is formatted as a reg key, so you can save it to a file “localaccount.reg” and just run it from the shell.

Windows Registry Editor Version 5.00


This enables shares of the form C$, D$ etc. This assumes that “File and Printer Sharing” is enabled (you’ll find this in the Control Panel : Network and Internet : Network and Sharing Center : Advanced sharing settings”).


Western Digital MyBook Live tweaks

The MyBook Live series is based around Linux, so you can fix some of WD’s design decisions and improve performance. You can do just about anything, but there are some simple performance improvements you can do.

Enable SSH

Assuming you are blocking port 22 from outside your network, enable SSH by logging in to the web dashboard, then switch to the page UI/SSH, and enable from there. If your NAS had the address, you would first go to

log in, then go to

to enable SSH. At that point, you can ssh to your box; ssh is built into Mac and Linux boxes, but you’ll need to use Putty or some other client on Windows.

I got this from What are the steps to enable SSH from the WD Community forums.

Fix monitorio.sh

WD has a monitor script as a daemon. That in itself is fine, but it periodically does an ls of the entire drive to build up some simple statistics. Once your drive has a large number of files on it, this will cripple performance. Since I don’t ever look at their stats, I prefer to disable this part of monitorio.sh. I want to repeat this: if you do this, then the web front end will no longer show accurate disk space usage (it will show whatever you had up to the point where you nerf the file_tally function).

I suggest the following: ssh to your NAS, edit /usr/local/sbin/monitorio.sh, rename file_tally to file_tally_old, and insert an empty file_tally function. Your file will look like this:

file_tally() {}

file_tally_old() {
        if [ ! -p $TALLY_PIPE]; then

You’ll want to restart the monitorio daemon (or reboot the NAS, which is more dramatic).

/etc/init.d/monitorio restart

You can also use ps and kill to stop any existing ls process if you’re impatient.

I didn’t figure this out, I got it from Solving the MyBook Live insane load.

Change to CFQ scheduler

Some people think that performance is better if you switch to the CFQ scheduler. This will definitely depend on how you use the NAS. WD has defaulted to the Anticipatory scheduler. You can check to see what yours is set at.

cd /sys/block/sda/queue
cat scheduler

and if you see

noop [anticipatory] deadline cfq

then your NAS is currently using the Anticipatory scheduler. You can switch by writing CFQ to the scheduler file like this

echo cfq >scheduler

And, if you have multiple drives in your WD Live box (for example, a Live Duo), you’ll need to do this for sdb as well.

I got this from Performance problems? Read this first.

You can read more about the various scheduling algorithms in this Red Hat page: Choosing an I/O scheduler for Red Hat Enterprise Linux 4 and the 2.6 Kernel. At some point, this will be out of date, but it was valid as of late 2012.