A monster regex to validate email addresses

RFC 2822 – Internet Message Format (which obsoleted/replaced RFC 822, but consider them the same thing). It specifies the format of an Internet Message, which to the rest of us is an email message.

One piece of RFC 2822 specifies the format of email addresses. Someone wrote a Perl regex to validate email addresses according to RFC 822. It is nightmarish

Mail::RFC822::Address: regexp-based address validation

I’ll reproduce 10% of it here.

(?:(?:\r\n)?[ \t])*(?:(?:(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t]
)+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:
\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(
?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ 
\t]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\0
31]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\
](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+
(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:

Yes, the full regex is 82 lines long. And it doesn’t handle the full spec, it does not handle comments in email addresses (comments can be nested, and a regex is an expression of a regular grammar, which can’t count, so can’t handle nesting). This was not generated by hand, of course, it was generated by Perl code combining multiple simpler regular expressions together.

By comparison, the grammar for email addresses, including the bits of the foundational grammar, are about 30 lines, and readable. The actual email address grammar is 17 lines of ABNF, and it looks like there’s around 15 lines of base elements (like “comment” and “FWS” (folding white space).

So why do people use regular expressions? Because, with current tools, it’s faster to write them, and they execute faster too. The “time to write” factor is so severe that you see people using ad-hoc (and broken) regular expression parsing even in areas where a real parser would be ideal.

Now, also, this is an example of getting into trouble in the first place by not having a grammar – email addresses evolved ad-hoc, and grammars were retrofitted to them. But still, the point is that the goal should be to make high-level and high-powered parsing techniques really easy to use and apply.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>