A monster regex to validate email addresses

RFC 2822 – Internet Message Format (which obsoleted/replaced RFC 822, but consider them the same thing). It specifies the format of an Internet Message, which to the rest of us is an email message.

One piece of RFC 2822 specifies the format of email addresses. Someone wrote a Perl regex to validate email addresses according to RFC 822. It is nightmarish

Mail::RFC822::Address: regexp-based address validation

I’ll reproduce 10% of it here.

(?:(?:\r\n)?[ \t])*(?:(?:(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t]
)+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:
\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(
?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ 
\t]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\0
31]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\
](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+
(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:

Yes, the full regex is 82 lines long. And it doesn’t handle the full spec, it does not handle comments in email addresses (comments can be nested, and a regex is an expression of a regular grammar, which can’t count, so can’t handle nesting). This was not generated by hand, of course, it was generated by Perl code combining multiple simpler regular expressions together.

By comparison, the grammar for email addresses, including the bits of the foundational grammar, are about 30 lines, and readable. The actual email address grammar is 17 lines of ABNF, and it looks like there’s around 15 lines of base elements (like “comment” and “FWS” (folding white space).

So why do people use regular expressions? Because, with current tools, it’s faster to write them, and they execute faster too. The “time to write” factor is so severe that you see people using ad-hoc (and broken) regular expression parsing even in areas where a real parser would be ideal.

Now, also, this is an example of getting into trouble in the first place by not having a grammar – email addresses evolved ad-hoc, and grammars were retrofitted to them. But still, the point is that the goal should be to make high-level and high-powered parsing techniques really easy to use and apply.

Finding bugs algorithmically

Visual Patterns in Software Process Data

Visualizing bug occurrence and bug fix rates, both for better tracking and thus projection of effort and time, but also to help improve the development process.

I liked the visualization techniques, especially the visual effort estimation, which can show strong tendencies in one direction (over-estimating, under-estimating, etc).

Predicting the Fix Time of Bugs

This paper tries to work out prediction models for bug fix times and bug fix effort, seeing if attributes of a bug report can be used to build a recommender system.

Faster Defect Resolution with Higher Technical Quality of Software

This paper investigates a possible link between technical quality of software and the rate at which defects can be resolved. Intuitively, we have talked about metrics like “small classes, low coupling, low duplication” as being components of good code, but how true is this?

Comparing Fine-Grained Source Code Changes And Code Churn For Bug Prediction

Predicting bugs by fine-grained churn in code; exact changes at the statement level, rather than gross file-level or commit-level churn.


Data mining of source code repositories

Studying the co-evolution of production and test code in open source and industrial developer test processes through repository mining

This paper uses mining of a repository to answer questions about the evolution of production code and test code. It introduces a Change History View visualizing commit-behavior over time, a Growth History View showing relative growth of code over time, and a Test Coverage Evolution View showing test coverage of a system over time.

Identifying Cross-Cutting Concerns Using Software Repository Mining

This paper uses repository mining to find cross-cutting concerns, pieces of functionality not captured into separate modules. Files that are changed together can indicate interdependent code.

Measuring Library Stability Through Historical Version Analysis

This paper introduces a way to measure interface and implementation stability of libraries.

Evaluating the Lifespan of Code Smells using Software Repository Mining

This paper investigates the lifespan of code smells, finding that, at least in the investigated open source systems, engineers are aware of code smells but not very concerned with their impact. At least in this paper, code smells are considered to be synonymous with anti-patterns at the programming level.

Identifying Changed Source Code Lines from Version Repositories

Revision repositories usually provide information about differences as insertions and deletions, making it very hard to track the evolution of individual source lines. This paper develops a method to infer source changes from diffs.


Software Development Environments on the Web

This is a thoughtful review and projection of where software development might go, given that many work modes are migrating partially or in whole to the web.

Software Development Environments on the Web: A Research Agenda

However, one piece missing here is a more thorough exploration of rich web clients, which really don’t exist at the moment. Why don’t they exist? Because they are currently hard to write. However, they are higher-performing than lightweight clients, and always will be. So how can we make rich clients easier to write?


Parsing bibliography

Not complete, but papers that I found useful or interesting.

2011 – Automated Evaluation of Syntax Error Recovery, Maartje de Jonge and Eelco Visser. The paper develops an automated method for parsers to recover when syntax errors are encountered.

2011 – An Algorithm for Layout Preservation in Refactoring Transformations, Maartje de Jonge, Eelco Visser. The paper develops an algorithm for preserving source even while manipulating the program as an abstract syntax tree, allowing for more comprehensive refatoring techniques.

2011 – Growing a Language Environment with Editor Libraries, Sebastian Erdweg, Lennart C. L. Kats, Tilman Rendel, Christian Kastner, Klaus Ostermann, Eelco Visser. The paper introduces the idea of extending IDEs through editor libraries, allowing syntatic and semantic features such as syntax coloring or reference following for arbitrary languages.

2011 – Integrated Language Definition Testing Enabling Test-Driven Language Development, Lennart C. L. Kats, Rob Vermaas, Eelco Visser. The paper promotes the idea and use of a language-parametric testing language (LPTL) to do reusable and generic tests of language definitions. Basically unit tests and test-driven-development for language design and implementation.

2010 – Pure and Declarative Syntax Definition: Paradise Lost and Regained, Lennart C. L. Kats, Eelco Visser, Guido Wachsmuth. The paper advocates for scannerless parsing, for ambiguity, for grammars written as naturally as possible, and thus for generalized parsing to be used.

2010 – The Spoofax Language Workbench, Lennart C. L. Kats, Eelco Visser. Spoofax is a language workbench based on Eclipse for developing DSLs, providing a comprehensive environment that integrates syntax definition, program transformation, code generation, and declarative specification of IDE components.

2010 – The Spoofax Language Workbench: Rules for Declarative Specification of Languages and IDEs, Lennart C. L. Kats, Eelco Visser. Implementation details of Spoofax.

Software Engineering Research Group Technical Reports